(Discussion) Easier Installation Process for DSpace

tdonohue commented 1 month ago

Describe the bug

This is a discussion ticket related to finding ways to simplify the Installation process for DSpace 7/8/9. It's a place to gather public notes & recommendations while developers work on solutions that can be brought back into DSpace code and documentation.

Brainstorms

Some (unproven) approaches to simplifying the installation process may include:

Using Docker more heavily to containerize everything and automate installation. We should keep in mind this may not align with the needs of all institutions, as not everyone uses Docker or is comfortable with Docker.
Removing Apache Ant & only using Maven (as suggested by @pnbecker in a comment below) for installation of the backend.
Building Tools to help automate/manage configuration and/or validate configuration between frontend & backend. Such tools might minimize common installation issues.
Is there any way to potentially embed (or download) the User Interface using the Maven build process of the backend?
- Some examples (one and two) exist for embedding Angular apps into Spring Boot. But these examples are all are specific to only using Client-Side Rendering (CSR) and not using SSR via Node.js. This approach may not be possible/easy, since SSR requires running the frontend via Node.js, rather than deploying as a single page application (SPA).

How can you help?

DSpace Committers & Developers are striving to improve the installation process of DSpace 7+, to make it easier to setup (less prone to configuration errors), and also easier to upgrade. We welcome brainstorms, solutions, shared code, improved docs that aligns with these goals. DSpace is community built/maintained software (with no centralized development team). The more details community members can share, the more quickly we can locate/design solutions to benefit everyone.

pnbecker commented 1 month ago

I think it would already be a benefit, if we could get rid of Ant. Having to use Maven and Ant is something people seem to find hard to understand. Having production ready Docker images would also be a huge win.

pnbecker commented 1 month ago

To reduce complexity in this area: With ant we currently filter the following files: default.context.xml, the log4j.properties, log4j2.xml and some RDF-Configuration. I think we have a good chance to get rid of this:

The RDF-Configuration is just being filtered to set a directory where the example configuration for a fuseki triple store is storing its files. We could remove this and let people set it manually if we documented that.
In default.context.xml we set the place where dspace finds its configuration. Since DSpace 8 the embedded tomcat is using a command-line parameter to get the same information. Is that already good enough? How does other spring application solve this?
For the logging we do set the log.level and the log.dir. I think it is fair to configure the log-level just in the xml files and not filter them. I have no idea yet how to solve the log.dir, but with going more and more to docker, we might use by default stdout and document how to change this to storing log files.

What would be a concrete use case? It would make it easier to start cli jobs out of IDEs. (You can do that already now, if you know what and how to do it, but this would make it easier).

mwoodiupui commented 1 month ago

I agree that we do not need to filter the RDF configuration (or any other) to customize examples.
"using a command-line parameter to get the same information." If that means setting a Java system property, this will break at sites which run multiple DSpace instances in a single Tomcat (as we do) because each instance needs a different value. The ServletContext is the proper place for the container to communicate with a single webapp.
Logging has a chicken/egg problem: DSpace could configure this programmatically, but then we could not log early startup issues because logging is not yet configured. I think this is another place where we should simply document that one must edit a single line in this file.

mwoodiupui commented 1 month ago

We may be trying to do too much because we cannot do enough. Filtering may not be powerful enough.

Perhaps we should view installation and configuration as two steps, and concentrate on promising that DSpace will start when installed according to directions but that one must then configure it to get a working repository. Logs will be written in /tmp or %TMPDIR% and you should change this. DSpace will run without a configuration path (or with a minimal configuration in WAR resources) but will not be usable until one is provided. There should be enough built in that a new instance can run well enough to tell you what is missing.

Configuration that is not necessary to get to a running state with a working administrative interface might be moved into the database and managed through administrative pages. This would create opportunity to validate nonessential configuration ("this property must be a valid {link ISO 8601 date}").

We should still try to ensure that the configuration step is not needlessly onerous.

mwoodiupui commented 1 month ago

We should consider moving some configuration out of dspace.cfg/local.cfg. There are places where we do ugly things with property names to simulate tables, and even trees of linked structures. Those would be more comprehensible as XML: one could see the relationships in the nesting. Spring DI could do the parsing and inject ready-made structures into the classes that manage them.

mwoodiupui commented 1 month ago

We should think about what must be configured to build, what must be configured to install, and what must be configured to run. Build time probably depends only on some Maven properties. Installation needs to know "where to copy the code" and "how to contact the database". I expect that everything else is runtime.

mwoodiupui commented 1 month ago

I'm not so sure about doing away with Ant. Ant is procedural and we need support for this during installation. Ant is also cross-platform and we need that.

Maven works by example using its own procedures, and may not be what we need for installation. It is billed as a project comprehension and build tool, not installation support. Maven creates build artifacts.

pnbecker commented 1 month ago

Great thoughts @mwoodiupui. Reading that, I thought DSpace is very unique with its installer. Most other software provides tarballs, rpms and/or deb packages. Should we overthink our installation approach? Most other comparable software would expect configuration at /etc/dspace and log to /var/log without asking. We could use these defaults and document how to change this.

mwoodiupui commented 1 month ago

Yes. Here I invented a new configuration property dspace.var and use it to locate the volatile directories separately from the code.

mwoodiupui commented 1 month ago

I have toyed with the idea of writing a graphical installer wizard, incorporating Ant as the script engine and Ivy to handle the build artifacts, but perhaps that is overkill. Maybe we only need to pack everything into an EAR that can be unzipped in the proper place.

tdonohue commented 1 month ago

Per the (brief) discussion in today's DevMtg, I just wanted to note that I fully support the idea of fixing our backend installation to no longer rely on Maven/Ant. In any ideal world, Maven would only be used if you needed to add in custom code...otherwise, you'd install DSpace without the need for Maven (and even possibly Ant).

I like the idea of finding a way to have "sane defaults" and either just unzip a package or have a basic installer (if easy to achieve). I will caution though that we do have people running DSpace on Windows in Production (not many, but you see those questions on lists occasionally). So, these "sane defaults" must work across multiple OSes, or have an easy way to tweak things like log location, etc (or have separate installers per OS). We cannot assume everyone is on Linux at all times.

As noted in today's meeting, discussion in this ticket is currently very specific to the backend. While it's great to hear ideas on simplifying the backing installation, we also need to find a way to simplify the frontend installation and simplify the configuration necessary to connect the frontend to the backend (or build a validation tool to help with that). All those are currently challenging. We especially see a lot of questions on support lists regarding getting the frontend & backend to "communicate" properly.

So, I'd also welcome ideas on installing both the frontend & backend with "sane defaults" connecting them. (I realize we may never make it a one-step install process, but simplification of the current process is necessary)

mwoodiupui commented 1 month ago

I would argue that the front-end installation process is absurdly simple: copy the dist/ directory tree to the place from which you want to run it.

The difficult part is what comes next: configuration, which must be coordinated with the back-end's configuration. We could use a tool for this, which would ask half a dozen questions and write configuration for both pieces.

mwoodiupui commented 1 month ago

A prerequisite for making the built back-end a distributable package is to carefully separate build-time, install-time and run-time configuration. Again, a "wizard" tool can help, by interviewing the admin. and writing separate configuration files for the three phases. The run-time file would be only a fragment sufficient to get the back-end to start successfully, and can be included by a more comprehensive set of runtime files.

We also need to be ruthless in minimizing the number of properties needed to make it runnable, so that these configuration builder tools will be simple and quick to use.

mwoodiupui commented 1 month ago

Since DSpace already depends on Javascript on the host, write the tools in Javascript? That should work cross-platform.

mwoodiupui commented 1 month ago

Remember that each piece runs in a container which provides an environment to the application. Configuration properties concerning "how the application is made to run" (ports, hostnames, "where's my database?") would be good candidates for this. Properties which concern the application's own functioning (not how it uses the rest of the system) should be separated from these and, if necessary, the environment should point to where they are stored.

DSpace / dspace-angular