Closed ehenneken closed 3 months ago
@aaccomazzi @ehenneken @seasidesparrow Is there any reason you do not want to adopt reading the metadata from solr for all the records? I think this approach is messy! Right now the pipeline expects filename, as you know. I am guessing you are either going to provide filename or bibcode, or both. I understand you want to move away from the classic file system, so my question is why not do that for all the records. That would be efficient and elegant, lol.
The solution is simple: make the input file uniform so that it consists of a list of filenames in classic format (which is currently what we use for published records). The all the I/O is the same and the only logic to implement is:
record = read_metadata(input_file)
if is_eprint(record):
matches = match_published(record)
else:
matches = match_eprint(record)
Any objections? If not, I can easily make this happen.
That was my understanding at the beginning. But now I am thinking, don't you want to move away from file system? Also if you grab metadata from solr the issue of missing record gets solved automatically. @aaccomazzi @ehenneken @seasidesparrow
Eventually, yes. But for the time being we don't have anything else to replace it with, since we'd like to do the record matching as soon as possible and not wait until the records are in solr.
Maybe this is something that should be considered soon, though, so that we at least have it on our radar when we build the architecture that manages different metadata assets.
cc @kelockhart and @tjacovich
In production since early 2024.
Currently, the pipeline needs to be able to match EarthArXiv (bibstem
EaArX
) and ESSOAr (bibstemesoar
) records against publisher records and vice versa.For these cases the situation is different in the sense that these preprint are not in the Classic PRE database. They have
doctype:eprint
, but they live in the ClassicGEN
database. Given the bibcodes for these new preprints, the easiest way to retrieve their metadata (for the matching process) is through Solr queries, via the API.Use the Pilot Project for EarthArXiv and ESSOAr to help implement this (e.g. 2021esoar.10508651L and 2022ApJ...938..138L)