adsabs / ADSDocMatchPipeline

Pipeline to match publisher document with preprint counterpart and vice versa
MIT License
1 stars 4 forks source link

Extend docmatching to include other preprint sources #24

Closed ehenneken closed 3 months ago

ehenneken commented 11 months ago

Currently, the pipeline needs to be able to match EarthArXiv (bibstem EaArX) and ESSOAr (bibstem esoar) records against publisher records and vice versa.

For these cases the situation is different in the sense that these preprint are not in the Classic PRE database. They have doctype:eprint, but they live in the Classic GEN database. Given the bibcodes for these new preprints, the easiest way to retrieve their metadata (for the matching process) is through Solr queries, via the API.

Use the Pilot Project for EarthArXiv and ESSOAr to help implement this (e.g. 2021esoar.10508651L and 2022ApJ...938..138L)

golnazads commented 11 months ago

@aaccomazzi @ehenneken @seasidesparrow Is there any reason you do not want to adopt reading the metadata from solr for all the records? I think this approach is messy! Right now the pipeline expects filename, as you know. I am guessing you are either going to provide filename or bibcode, or both. I understand you want to move away from the classic file system, so my question is why not do that for all the records. That would be efficient and elegant, lol.

aaccomazzi commented 11 months ago

The solution is simple: make the input file uniform so that it consists of a list of filenames in classic format (which is currently what we use for published records). The all the I/O is the same and the only logic to implement is:

record = read_metadata(input_file)
if is_eprint(record):
    matches = match_published(record)
else:
    matches = match_eprint(record)

Any objections? If not, I can easily make this happen.

golnazads commented 11 months ago

That was my understanding at the beginning. But now I am thinking, don't you want to move away from file system? Also if you grab metadata from solr the issue of missing record gets solved automatically. @aaccomazzi @ehenneken @seasidesparrow

aaccomazzi commented 11 months ago

Eventually, yes. But for the time being we don't have anything else to replace it with, since we'd like to do the record matching as soon as possible and not wait until the records are in solr.
Maybe this is something that should be considered soon, though, so that we at least have it on our radar when we build the architecture that manages different metadata assets. cc @kelockhart and @tjacovich

seasidesparrow commented 3 months ago

In production since early 2024.