Closed aaccomazzi closed 9 years ago
Yep, keywords are a mess for sure. I could never figure out how to reproduce what is in the current document, or even if we want that.
It is trivial to change the relevant fields in SolrUpdater.py
, but the decision as to what to index/store needs to be made beforehand.
@aaccomazzi could you take this on, or at the very least tell me what we want?
Note that there are at least 3 keyword-ish fields.
Fixed in c75d37a7a91a1dcb303e8eb1f9af65434abc26a5
Still see some repetition, e.g. for 2014arXiv1406.4542H, I see this:
"keyword": [
"Computer Science - Digital Libraries",
"Astrophysics - Instrumentation and Methods for Astrophysics",
"methods numerical",
"-",
"Computer Science - Digital Libraries",
"Astrophysics - Instrumentation and Methods for Astrophysics"
],
@aaccomazzi , did you want to be the asignee for this? I can pass the list through set()
if that would solve it.
Nope, something is wrong with the way the keywords are duplicated and include the normalized version, so they do not match keyword_schema any longer. Needs looking into.
Assigned to Roman, fixed?
@romanchyla , @aaccomazzi , am I correct in thinking this has been solved? Can I close?
Fixed AFAICT
For the paper 2000A&AS..143...41K the SOLR field "keyword" has three times as many entries as it should:
The correct list should be:
The extra entries seem to be normalized versions from this set, plus a duplicate from arXiv.