adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Duplication of keywords in SOLR #63

Closed aaccomazzi closed 9 years ago

aaccomazzi commented 9 years ago

For the paper 2000A&AS..143...41K the SOLR field "keyword" has three times as many entries as it should:

 "keyword": [
          "methods data analysis",
          "-",
          "-",
          "-",
          "astrophysics",
          "Astrophysics",
          "METHODS: DATA ANALYSIS",
          "ASTRONOMICAL DATABASES: MISCELLANEOUS",
          "PUBLICATIONS: BIBLIOGRAPHY",
          "SOCIOLOGY OF ASTRONOMY",
          "Astrophysics",
          "methods data analysis",
          "-",
          "-",
          "-",
          "astrophysics"
        ],

The correct list should be:

"keyword": [
          "METHODS: DATA ANALYSIS",
          "ASTRONOMICAL DATABASES: MISCELLANEOUS",
          "PUBLICATIONS: BIBLIOGRAPHY",
          "SOCIOLOGY OF ASTRONOMY",
          "Astrophysics"
],

The extra entries seem to be normalized versions from this set, plus a duplicate from arXiv.

vsudilov commented 9 years ago

Yep, keywords are a mess for sure. I could never figure out how to reproduce what is in the current document, or even if we want that.

It is trivial to change the relevant fields in SolrUpdater.py, but the decision as to what to index/store needs to be made beforehand.

@aaccomazzi could you take this on, or at the very least tell me what we want?

Note that there are at least 3 keyword-ish fields.

aaccomazzi commented 9 years ago

Fixed in c75d37a7a91a1dcb303e8eb1f9af65434abc26a5

aaccomazzi commented 9 years ago

Still see some repetition, e.g. for 2014arXiv1406.4542H, I see this:

   "keyword": [
          "Computer Science - Digital Libraries",
          "Astrophysics - Instrumentation and Methods for Astrophysics",
          "methods numerical",
          "-",
          "Computer Science - Digital Libraries",
          "Astrophysics - Instrumentation and Methods for Astrophysics"
        ],
vsudilov commented 9 years ago

@aaccomazzi , did you want to be the asignee for this? I can pass the list through set() if that would solve it.

aaccomazzi commented 9 years ago

Nope, something is wrong with the way the keywords are duplicated and include the normalized version, so they do not match keyword_schema any longer. Needs looking into.

vsudilov commented 9 years ago

Assigned to Roman, fixed?

vsudilov commented 9 years ago

@romanchyla , @aaccomazzi , am I correct in thinking this has been solved? Can I close?

aaccomazzi commented 9 years ago

Fixed AFAICT