adsabs / ADSDocMatchPipeline

Pipeline to match publisher document with preprint counterpart and vice versa
MIT License
1 stars 4 forks source link

Earthsci preprint matching.2023 nov02 #26

Closed seasidesparrow closed 11 months ago

seasidesparrow commented 11 months ago

I think it comes down to deciding whether to have the parser doing system-level logic, or whether the parser acts based upon what the controlling code tells it to do. In this case, the pathnames are definite only at this moment in time, but we're in a transition period where the underlying system architecture or flow control may change. If we use a calling option like what I've implemented to specify that logic, we're not reliant on the current architecture, and that would be my preference.

golnazads commented 11 months ago

I would discuss this with @aaccomazzi since he was the one explaining to me that the input list needs to contain both arXiv and earth science. Your design distinguish these and you have to have a list for arXiv and list for earth science to get processed.

aaccomazzi commented 11 months ago

I'm pretty sure we had this conversation a few months back and from my perspective nothing has changed, so let me restate what I think we should do:

  1. Unify the format for the input list of records to be matched -- right now for arXiv we read the source DC XML but there is no reason why we could not use the classic metadata files, which have the same format for both eprint and published records. If we do that, we have a single read function that at some point can be swapped out in the future with a read from a database. The input in all cases can simply be the full path of the metadata files.
  2. Implement trivial logic that decides whether the input metadata is for an eprint or not -- right now this is a 2-line function simply checking the bibstem (extracted from the bibcode) and not the pathname, but to further future-proof this for the day when the bibcode/bibstem may not be there we would have a metadata field provide the same information from our ingest data model
  3. There is no 3: if the input is an eprint call the match_to_published function, and vice-versa

If I missed/misunderstood something let's have a chat.

golnazads commented 11 months ago

@aaccomazzi I did not know that the classic format can correctly read arXiv metadata. Then we should remove the arXiv parser and use the classic reader for all input files this way there is no need to identify what kind the input metadata file is. That should solve the issue. Thank you. @seasidesparrow

golnazads commented 11 months ago

@seasidesparrow I have time today, I can verify that classic parser can read the arXiv file and extract all the information that arXiv parser extracts. Let me know if you want me to do that.

seasidesparrow commented 11 months ago

I would need to make some changes to the classic side to have it output paths to the .abs files instead of to the xml files and test it, so this isn't something I would want to deploy before next Tuesday at the earliest.

golnazads commented 11 months ago

@seasidesparrow Let me check it out. I shall let you know what I found out.

golnazads commented 11 months ago

@seasidesparrow I went ahead implemented this and made a release, if you would please check it out. https://github.com/adsabs/ADSDocMatchPipeline/releases/tag/v3.1.4 Now you can include earth science records among the arXiv records and submit them to pipeline to get matched with publication. Or included them among the pub records and submit them to get matched with arXiv records. You can also submit them separately, for example if earth science records come as eprints during the weekend, you can create eprint.input list and process them then, the same if they come in as pub records at any time, you can create pub.input and process them at once. Very flexible. Please let me know if there is any issue. thank you. @aaccomazzi

seasidesparrow commented 11 months ago

Closed: superseded