adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Possible bug: ArXiv direct ingest records are not being picked up on the correct entry date. #235

Closed seasidesparrow closed 2 years ago

seasidesparrow commented 4 years ago

After reviewing the relevant tasks, I can't see where we're passing entry date into the Solr record with DI. The pubdate in the metadata payload will often predate the entry date. How (and where) can we change this so that we duplicate the way classic adds the entry date?

seasidesparrow commented 4 years ago

I think this line in aip.direct.ArXivDirect may be responsible:

https://github.com/adsabs/ADSImportPipeline/blob/9c02d896f4667edb274558355c6c318bd7a2b14b/aip/direct/ArXivDirect.py#L38

I think we should change entry_date from record['pubdate'] to now().

spacemansteve commented 4 years ago

direct ingest must provide an entry date. the value to use for entry_date varies based on whether on not the classic pipeline has ingested this record.

if the classic pipeline has ingested this record we want to reuse the classic pipeline's ingest time. if the classic pipeline has not seen this record we want to use datetime.now.

There are two ways to determine if the classic pipeline has ingested this record. One is to try to instantiate an ADSRecords object (which will read classic's actual exported object from disk). The second way is to read from ADSimportpipeline's sql table.

I think the best of these two options is the first because the exported classic info is closer to the source and it is considered an authoritative source. In the interest of time, I could live with a temporary solution that used ADSimprtpipeline's sql table.

aaccomazzi commented 4 years ago

Need to push this to the forefront, as entry dates affect notifications created by myADS. Essentially we need to ensure that updated (not new) records coming from the arXiv nightly feed are not given a current entry_date by direct ingest.

seasidesparrow commented 4 years ago

I can make the change to AIP that I suggested above, but I'm not entirely sure it will address @spacemansteve comment that followed.

aaccomazzi commented 4 years ago

I wonder how the citation capture pipeline deals with updates (vs. new records). I assume it holds a copy of its own data and can therefore look into it to see if the record was seen before. @marblestation can you please confirm? Either way, my feeling ATM is that the less we rely on classic, the better off we are.

In that light, I would prefer a solution in which a pipeline such as ADSImportPipeline can deal with a record which comes with an empty entry_date and does the following:

  1. if the record is already found in the database, reuse the entry date associated with it.
  2. if the record is not in the database, create entry_date from now()

The advantage of this would be the following:

  1. for records ingested directly by DirectIngest, we don't need to worry about setting a date, and the system will assign a sensible date under all circumstances
  2. for records ingested by a classic pipeline, an entry_date is by definition present (including classic arXiv ingest), so the behavior is backwards-compatible
  3. for publisher records, entry_date is still under the control of curators, which is what we want. But as we take pieces of classic away and pump more content in through direct ingest, things will continue to work as expected