cern-sis / issues-inspire

0 stars 0 forks source link

Improve desy harvest #370

Closed MJedr closed 12 months ago

MJedr commented 1 year ago

Currently, desy harvest is error prone because we put all the incoming files (jsonl, pdf) in one bucket (inspire-publishers-desy-incoming). Then, after processing each json inside of the jsonl we move the documents referenced inside of the json to another bucket. If by mistake somebody adds the same jsonl file or reference twice same document, the error will show up.

The solution might be to add all the files referenced in jsonl file in one directory with the jsonl. The process should look like that:

MJedr commented 1 year ago

Tech notes In desy spider: