adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Create Origin field in SOLR document #149

Open aaccomazzi opened 7 years ago

aaccomazzi commented 7 years ago

Right now no Origin field is found in our solr documents. Presumably this is because it's just a bit difficult to figure out what source files have been used to generate the output solr document according to the merging rules. The output of ADSCachedExports contains information about where the metadata/data came from, e.g.

  <record bibcode="2017SenIm..18...17Z" entry_date="2017-05-24">
    <metadata origin="SPRINGER" type="general" primary="True" alternate_journal="False">
      <creation_time>2017-05-31T23:55:39Z</creation_time>
      ...
    </metadata>
    <metadata origin="ADS metadata" type="properties" primary="False" alternate_journal="False">
    ...
    </metadata>
    <metadata origin="Springer" source="iss1.springer.xml" type="references" primary="False" alternate_journal="False">
    ...
    </metadata>

We have two options:

  1. Do the right thing and keep provenance of what gets merged. Then the origin field should be the set of origins which were used to build the merged document. This may be difficult in that the merging process may not be coded in such a way that provenance tracking is possible
  2. Be lazy and just collect the list of origins out of the ADS Exports document and use that instead
romanchyla commented 6 years ago

How is this field going to be used/useful?

aaccomazzi commented 6 years ago

We have an output format which requires the "origin" field. We also have use cases (for our own curation) where we want to find records which contain data from a particular origin.

romanchyla commented 6 years ago

added 'origin' to solr - multivalued/string/searchable/stored - whatever outputs the list of values, can indicate their relative importance by sorting them (most important first); other than that I don't see a value in trying to encode more details

romanchyla commented 6 years ago

ok, so aip merger will need to be updated - i'm putting this to backburner; the field is not too important - but it is pain to get it out

sb will get back to it