cern-sis / issues-scoap3

0 stars 0 forks source link

OUP harvesting #93

Open ErnestaP opened 1 year ago

ErnestaP commented 1 year ago

Harvesting: Add OUP publisher files fetching DAG. Important information for a task solution:

Harvesting steps:

Fetch data from FTP (pdf, xml, pdfa). All of them are in separate zips P.S. really similar to Springer and IOP Save it in s3 Download from s3 Split XML (downloaded XML might consists of more than one article. Split means, one smaller XML - one article) Trigger runs of processing DAG. One run has to be triggered with one article. The response is XML, use ElementTree as a parsing lib Expected behavior : Input: XML doc., which might consists of more than one article. Output: trigger files processing DAG

Tests are mandatory Important: verify with Anne