Beyond Dataverse 5.0: make it pluggable (on a code level)

poikilotherm commented 4 years ago

tl;dr: we should discuss how to make Dataverse more pluggable and open to third-party extensions.

While digging through the causes of why our WAR file is so big (will be ~135MB for Dataverse 5.0, was ~205MB for 4.20), I stumbled over a lots of things that could use a hand. This is mostly about refactoring old code, libraries and dependencies. Great - I like that :pick: . (See an older list to be updated at #5360)

But what is likely not to be shrinkable: Apache Tika. It's used for the great full text indexing component, provided by @qqmyers (thank you! :+1: ). But: that's a 45MB increase in WAR size for a feature which is completely optional. I have no figures how many installations actually enabled and use it. For "us" (Jülich DATA) it makes images bigger and adds to deployment time :neutral_face: .

In many cases, new and great features are accepted by IQSS and merged (yeah! :tada: ), also they might not be using it for Harvard. But is this a good approach? Maintenance effort is put onto people that focus on other features. Testing is necessary and I'm not the only one around that has some grey hair from the status quo :face_with_head_bandage: (looking at you @4tikhonov, @skasberger, @kcondon, @donsizemore, @pdurbin and others).

Back in the day of #4106 people started to fork just to add support for some functionality that was rejected (for good) by IQSS. But it's a pity you're forced into forking, which is always a big tradeoff. :balance_scale:

Now we have great new working groups! :heart: Metadata! :heart: More of that! Yes please! :+1: And it's likely that there's more ingest stuff coming down the road. Lot's of new features, which don't fit under "External Tool" or "Integration" like the shiny previewers, but code that needs access on a "Java-level".

Those "extensions" or "plugins" should be easy to install for admins (no compiling and fiddling with Maven like DSpace) and developed independently from IQSS, offloading maintenance, testing and development. Ideally people would start to share their plugins. Or even start selling them. Lots of new options.

There a few ways to do this :brain: . One is using small frameworks like https://pf4j.org. There are more and others. I am eager for more input. What do y'all think? Especially: what does our beloved architect @scolapasta think about this? :bow:

This should start small and where we all see fit. It can grow as we go, trying to find the sweet spot of best community support vs. refactoring burden. Is there a chance to find some funding to enable this next generation repository technology? (Yeah, I know this is nothing new from a technical perspective, but it seems to be from a community perspective.)

stevenmce commented 4 years ago

Keen to see this develop - there are possible implications for our "Request Data Access" extension/external tool/... here at ADA - ping @mdmADA

poikilotherm commented 4 years ago

I should add a disclaimer: I don't see this as a plugin thing using SPI, but using classloaders loading real JARs from the class path. Those can be managed and installed on their own lifecycle.

One could debate what makes more sense, as SPI-based plugins could be placed in java.ext.dir, but I'm not sure how feasible that is. Either way, using SPI or other approaches, would need a transition of code infrastructures (a start has been made by @ekraffmiller for the metadata exporters, using a SPI).

SPI might be a bit more tedious - there is Google AutoService in use already, but this adds more weight to the WAR as it depends on Guava. Other frameworks not using SPI might be lighter.

djbrooke commented 4 years ago

Thanks @poikilotherm for starting the discussion. I don't have thoughts about the implementation, but I'm excited about this for the reasons that you outline above. Looking forward to talking in more detail.

qqmyers commented 4 years ago

A concrete example might be useful. What would it take to make full-text indexing plug-able - in terms of code changes as well as how we'd manage the plug-in and what admins would have to do to install/run it?

poikilotherm commented 4 years ago

Examples. Alright, here we go.

Currently I see three SPI-based "extensions" within the codebase:

Metadata exporters (using @AutoService)
Data ingest (using a home brew registry)
Workflows (internal use only, SPI loading commented out)

If we feel like using sth. different than SPI with its tedious registration etc would be great: this is already decoupled code, ready to be moved into a submodule, code base whatever. If it should remain as an SPI component, there are still options to move things around. Great!

Full-text indexing on the other hand is tightly integrated with the IndexServiceBean.

To decouple:

Business logic of fulltext indexing handling would need to be moved to an independent Maven project. Ideally this would become a JAR on its own, no matter if using SPI or a plugin framework (class loader).
An interface would have to be created, much like the SPIs in the other places. The index service bean logic would rely on that interface, not the implementation. If the plugin is not installed or disabled by configuration, it would not be used, failing gracefully.

Down the road, decoupled code parts could be changed on their own. Maybe other would like to create their own implementation, using other parsers. Or want to add batch support. Basic "Code Against Interfaces, Not Implementations", which is a much older principle than 25 year old Java.

Another good example would be the PID provider thing mentioned above: moving this to a SPI based infrastructure would be a great first step to decouple. A provider like EZID might be removed from the codebase, making the WAR smaller. They would still be usable for those that need it.

BTW this would be a good chance to refactor the index and ingest service beans. Those methods are huge and AFAIK without any units tests for a great amount of business logic. I'm not sure how well API tests cover those.

qqmyers commented 4 years ago

Tika is called in one place and the interface is basically that tika gets the file inputstream and produces a string. The code to get Tika to manage that is a few lines, which is what I would assume would go into a Tika-based plugin for full-text indexing. Assuming a FullTextIndexer.getTextStringToIndex(InputStream is) interface, what else is needed? The few Tika-specific lines needed to implement that interface would go in a TikaFullTextIndexer class extending FullTextIndexer? What other code changes are needed to then handle finding and using that specific class and the tika library from outside the main war - in the code and installers?

poikilotherm commented 4 years ago

@qqmyers I haven't looked very deeply into the full text indexing code - I'm sure you know those by heart... :wink: Moving this into an API should be fairly easy. Maybe it should be a bit more generic to support other use cases like metadata indexers, too. Some design upfront might be a good idea, but would be beyond scope for this issue IMHO.

The index service bean or some other appropriate place would need a loading mechanism for the plugin. These are just a few lines of code, executed for example during startup of Dataverse.

FWIW: PF4J would provide a common code infrastructure to maintain both SPI-like extensions and plugins without thinking different coding styles.

The installers would be extended with a new option that a JAR should be installed to enable full text indexing. Where that JAR is coming from (bundled in the ZIP, download from Github/Maven Central/...) is another story and would depend on where the code base lives (Maven submodule in main repo, independent side project, ...).

qqmyers commented 4 years ago

Metadata indexing just involves pulling the entries from the db and handing them to solr, so I'm not sure there's ever any reason to call external code for that. The one extension for full-text I can think of would be to also pass the mimetype and allow registration per mimetype (Tika appears to scan the file and send the bytes off to the right underlying parser itself, but I could imagine someone wanting a custom full-text indexer for their particular file format/mimetype that isn't covered by Tika).

In any case, given that Tika is one of, if not the, biggest contributor to war size, it seems like a good place to explore the concept and get some immediate value for places not using full-text indexing. I'm happy to update the code to add an interface if you're able to handle the configuration aspects. Do we need a global decision on anything like using PF4J up front? Or could we do full-text indexing this way and adapt later if needed (not advocating for every SPI to work differently, but if changing isn't too hard, having a real example of PF4J in use might make it clearer how the other existing SPIs and new ones would be simpler if we changed.)

poikilotherm commented 4 years ago

FWIW: metadata exporters like DDI et al could me made even more extensible, too (So not just more formats, but extend existing formats, too). See #4451 for a use case by @BPeuch.

pdurbin commented 2 years ago

8631

cmbz commented 1 month ago

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

IQSS / dataverse

Beyond Dataverse 5.0: make it pluggable (on a code level) #7050

8631