fcrepo-exts / fcrepo-camel-toolbox

A collection of ready-to-use messaging applications with fcrepo-camel
Apache License 2.0
13 stars 26 forks source link

Cached triple values are stored in values.dat which seems to grow infinitely #187

Open oleksii-datsiuk opened 2 years ago

oleksii-datsiuk commented 2 years ago

When we try to use toolbox for indexing Fedora 6 RDF data in Solr, we have found that LDPath processing caches all RDF data in "ldcache" directory. And some of these data seem to never be removed. The most problematic seems to be file ldcache/triples/values.dat. Looks like it accumulates all the data that ever were retrieved from Fedora and is never cleaned up.

At least I tried to analyze sources of fcrepo-camel-toolbox, Apache Marmotta libraries, openrdf-sesame library. And I haven't found use case under which toolbox could remove anything from values.dat.

I have found that LDPath processing in toolbox is hardcoded to use LDCachingFileBackend, which is responsible for creating whole ldcache folder. But for me it looks like it would be useful to have possibility to use some other caching backend, which will have in-memory cache, which is not collecting too much old data, and which is not serializing any data to hard drive.

Actually I have implemented such a backend, and would like you to consider possibility to integrate it to main source base, so that we don't need to make our own customized build.

Here is my reasoning why such in-memory caching backend would be working well at least for Solr-indexing use-case. As I see, toolbox is sequentially processing messages from ActiveMQ queue (or topic). Then it requests RDF data from Fedora, apply LDPath transformation to them, and send resulting document to Solr.

So, from LDPath we need quite simple behavior: take data from fedora and transform them. We don't need to cache these data (especially when data are cumulatively saved in single huge file). In our case this caching is even harmful, because in our scenario it is possible that object in fedora will be removed and then recreated with the same ID but different content. And LDPath will not request fresh data from Fedora until value in cache is expired (after expiration it will request fresh data, but old data will remain in values.dat as well).

So, for "just transform" use case LDPath really doesn't have to cache any data. However it looks like Marmotta implementation needs to have some cache anyway (at least to cache values of current document before transformation is finished). But for this it is enough to have some in-memory cache, which contains small amount of entries, and caches them for small amount of time (maybe 1 minute or so, while single transformation request is in process).

In our use case small amount of items in cache is also important. Because RDF document contains field with fulltext (extracted by TIKA) of some binary ZIP package which is also stored in Fedora repository. Theoretically this field value could be hundreds of megabytes, so we don't want to store too much of such values in memory. And if I understand correctly, toolbox is processing Fedora messages sequentially in single thread, so we don't need large cache.

As I mentioned above, I have implemented caching backend which will work for described use-case. And I suspect, that this solution could be useful for other people as well. So if you agree, I may create a pull request with my changes. Or if there is something wrong with my argumentation, please let me know.

whikloj commented 1 year ago

Hey @oleksii-datsiuk sorry for the delay. Due to Apache Marmotta being archived and other issues users have encountered (like this) we are looking at removing the whole LDPath part of the camel-toolbox and going with XSLT to support the use case found here. If you are interested in helping out with this effort we'd be happy to help you with this task.

oleksii-datsiuk commented 1 year ago

HI @whikloj

Thank you very much for response.

I would like to tell that I have quite some experience with XSLT (2.0) and would NOT recommend it when documents are going to be quite large (like in my case). We had serious problems with XSLT performance and memory usage when input XML file was just 100-200 MB.

XSLT engine loads whole document into the memory, parses it, and it starts taking really much RAM. I'm not sure, maybe XSLT 3.0 engine already supports some kind of streaming (to not keep whole input/output documents in memory). I had hard times optimizing XSLT files so that they work with just 100-200 MB XML files (taking reasonable amount of memory and time).

So I would definitely recommend some stream-based approach (at least for scenarios, when complex transformations are not needed). You just read some small portions of data from input stream, and immediately write resulting data to the output stream (to disk or directly to network). So that it never loads whole document to the memory and doesn't perform complex logic which is done by XSLT.

To make this work within single pass maybe it will be needed to save each input document triple to separate file on the disk and then concatenate them to the single output stream according to mapping program (similar to LDPath) (again without loading any large amounts of data to memory). But this file-based logic could be used only if document size exceeds some threshold (to not slow-down processing in case of large amounts of small documents).

If you think that this approach makes sense, I may try to look whether I find some time to implement it.

whikloj commented 1 year ago

Hey @oleksii-datsiuk just wondering if you saw the ldpath.fcrepo.cache.timeout setting, this defaults to 0, but you could certainly change that to cause your cache to clean its self up.

oleksii-datsiuk commented 1 year ago

Hi @whikloj Yes, I saw 2 settings: ldpath.fcrepo.cache.timeout and ldpath.cache.timeout. But neither of them forces values.dat to be finally cleaned. And as far as I could understand code of toolbox and libraries, values.dat is never cleared. That`s why I have implemented in-memory caching backend, which eliminates values.dat creation at all (together with other files). I could provide my changes as pull request.