adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Direct ingest of arXiv records should use the list of new records generated by Classic #255

Closed seasidesparrow closed 3 years ago

seasidesparrow commented 3 years ago

From email from @aaccomazzi on 2021-Jan-08:

BTW, one way to avoid re-ingest via direct ingest is to update the direct ingest pipeline to use this ingest list, where only new records should appear:
/proj/ads_abstracts/sources/ArXiv/log/2021-01-07/new_records.tsv

Rather than these:
/proj/ads_abstracts/sources/ArXiv/UpdateAgent/UpdateAgent.out.2021-01-07.gz

Kelly modified the myADS pipeline to use the former, but I don't believe DI was ever updated.  This should fix the immediate problem.  Nonetheless, we still need to get the deletions right sooner or later.
seasidesparrow commented 3 years ago

This should be straightforward to fix, see run.py https://github.com/adsabs/ADSImportPipeline/blob/5a2d75f416997f525e8d9782b17463121a7cef85/run.py#L228

seasidesparrow commented 3 years ago

Fixed in v1.1.16