Improve desy harvest - Githubissues

Currently, desy harvest is error prone because we put all the incoming files (jsonl, pdf) in one bucket (inspire-publishers-desy-incoming). Then, after processing each json inside of the jsonl we move the documents referenced inside of the json to another bucket. If by mistake somebody adds the same jsonl file or reference twice same document, the error will show up.

The solution might be to add all the files referenced in jsonl file in one directory with the jsonl. The process should look like that:

curator uploads a directory to s3 that contains all the things required to process jsonl
we crawl bucket periodically, we check for a list of directories in the bucket - if there's a new one (not present in the output bucket), we start parsing jsonl inside of it
when we finish processing all the jsons in jsonl (including adding documents, but in this scenario we don't move pdfs to another bucket), we copy the whole directory to the output bucket and delete it from the incoming bucket

cern-sis / issues-inspire

Improve desy harvest #370