Currently, desy harvest is error prone because we put all the incoming files (jsonl, pdf) in one bucket (inspire-publishers-desy-incoming). Then, after processing each json inside of the jsonl we move the documents referenced inside of the json to another bucket. If by mistake somebody adds the same jsonl file or reference twice same document, the error will show up.
The solution might be to add all the files referenced in jsonl file in one directory with the jsonl. The process should look like that:
curator uploads a directory to s3 that contains all the things required to process jsonl
we crawl bucket periodically, we check for a list of directories in the bucket - if there's a new one (not present in the output bucket), we start parsing jsonl inside of it
when we finish processing all the jsons in jsonl (including adding documents, but in this scenario we don't move pdfs to another bucket), we copy the whole directory to the output bucket and delete it from the incoming bucket
add a new method to start requests instead of this. It should check if the whole directory is inside of the processed bucket, if not yield similar request to this. Of course modify request's url.
when adding document, don't move it to processed here
In the end of parsing move all the directory to processed there
In record's acquisition source add the name of the jsonl file
Currently, desy harvest is error prone because we put all the incoming files (jsonl, pdf) in one bucket (
inspire-publishers-desy-incoming
). Then, after processing each json inside of the jsonl we move the documents referenced inside of the json to another bucket. If by mistake somebody adds the same jsonl file or reference twice same document, the error will show up.The solution might be to add all the files referenced in jsonl file in one directory with the jsonl. The process should look like that: