ENH: optimize translator discovery and reuse process for multiple translate operations

jsheunis commented 1 year ago

Comments by @mslw in https://github.com/psychoinformatics-de/rfd/pull/68:

A translator is expected to be exposed by an extension as an entry point (just as is the case for extractors), and catalog translate command is responsible for reading a json lines file, matching, and applying translators. Which is awesome for files with tens / hundreds of lines coming from different extractors, but terrible for files with tens of thousands lines all from the same translator (e.g. metalad core file-level translator that's used to produce listings of files with their sizes etc.).

However, translation with catalog translate is not only single-threaded, but also repeats translator matching process for every single input line (1 dataset file - 1 json line of extracted output), taking hours for the largest datasets.

We should improve the translator discovery process in datalad-catalog to e.g. only match a translator once and then somehow keep track of that and not have to reinstantiate it when a new record requiring the same translator comes along.

jsheunis commented 1 year ago

Ideas by @mslw:

A simple solution to cut down translation time massively would be to take the translator and call its translate() method directly, bypassing the discovery process. Another idea worth exploring is to forget the file traversal, extraction, and translation altogether, and try to get catalog-schema metadata from git annex output for the dataset.

jsheunis commented 1 year ago

Kind of a duplicate issue:

https://github.com/datalad/datalad-catalog/issues/264

mslw commented 1 year ago

I agree with #264 - I think it would be great to keep current behaviour (match per line) and have an option to fix the translator upfront (or from first line).

mslw commented 1 year ago

Another question aside from pickjing translators upfront - although we decided to make the translator's match() a class method, the translate code still instantiates a translator object:

https://github.com/datalad/datalad-catalog/blob/84ffaa7d2b50fffda8079f1ecebe22882e3b9f7a/datalad_catalog/translate.py#L99-L108

I understand that this was done because of concerns that match() may still need to rely on instance methods https://github.com/datalad/datalad-catalog/pull/246#issuecomment-1429677754 -- but I would still argue that in this case we lose all the benefit, and should either abandon the class method approach or expect all methods used for matching to be class methods (preferred).

Although, TBF, I should test whether there is any significant time to be gained :smile:

mslw commented 1 year ago

Although, TBF, I should test whether there is any significant time to be gained smile

Yeah, no, I get almost identical times when adjusting translators and only switching lines 103-105 to create instance only after match :roll_eyes: Gains are to be made elsewhere, probably...

jsheunis commented 12 months ago

Closed by https://github.com/datalad/datalad-catalog/pull/309

datalad / datalad-catalog

ENH: optimize translator discovery and reuse process for multiple translate operations #305