OPUS4 / application

OPUS 4 application.
Other
15 stars 21 forks source link

Using Gearman for long fulltext extractions #455

Open j3nsch opened 2 years ago

j3nsch commented 2 years ago

If a PDF file is really big, the extraction of the full text for indexing, might take several minutes. If we run a re-indexing of all documents on the command line, that isn't a problem. It just takes as long as it takes. However if a new document is uploaded the extraction happens immediately and if that takes too long the request will timeout and the user won't get a proper response. Therefore there is a time limit for the full text extraction that happens as part of a user request. That guarantees that there will be a response and that the metadata will end up in the index, but the PDF file won't be indexed in those cases.

TODO How can we fix this using Gearman? Can we?

For large files, we could push the full text extraction as a job to Gearman, index the metadata immediately and keep going. In this case we also don't have to wait for a timeout, we just check how large the file is, first, and push the large ones in the background.

If the system works really well, we could also simply push all text extractions into the background. However this won't become really useful until we are able to do atomic updates of documents, where can update individual fields in the index and thereby and fill the full text field later, without re-indexing the entire document. Of course the rest of the data is usually small and could be indexed again. It might matter more if a document has multiple large files, that could be added to the index separately. There are known extreme cases with documents having multiple files that add up to more than a gigabyte.

The Gearman job would have to perform the extraction and then trigger a re-indexing or update (at some point in the future) of the document. If we are running a full indexing, that could mean that jobs will trigger index-request while the full indexing is still running, but I don't see how that is different from multiple parallel requests during normal operation, so I think it should be fine.

I think what has been described above would already solve a lot of problems and would be enough for this issue. However what else could be build on top, that would be useful? We should create additional issues for those ideas. For instance we could use Javascript to check on the extraction status of documents that have been uploaded in the Publish form. The extraction could start right after the upload, while the metadata is being entered and the full text might ready for indexing even before the new document is finally submitted. That is covered in issue #460.

Intern: https://tickets.zib.de/jira/browse/OPUSVIER-4412

j3nsch commented 2 years ago

@kaustabhbarman I have created this issue as a concrete example for using Gearman, that would be very useful for OPUS 4. At the end of the description is the outline for another improvement that could be build on top of that. Please, create an issue for that and update the last sentence of the description to point at the newly created issue. Thank you!

kaustabhbarman commented 2 years ago

@kaustabhbarman I have created this issue as a concrete example for using Gearman, that would be very useful for OPUS 4. At the end of the description is the outline for another improvement that could be build on top of that. Please, create an issue for that and update the last sentence of the description to point at the newly created issue. Thank you!

460 refers to this.