adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

UAT keywords are not being normalized in solr_adapter _keyword_norm #271

Open seasidesparrow opened 2 years ago

seasidesparrow commented 2 years ago

See 2022ApJ...927....1M:

"keyword": ["1964", "1483", "1989", "1485", "2009", "1974", "1477", "1503", "1476", "1533", "1493", "2170", "Astrophysics - Solar and Stellar Astrophysics"], "keyword_norm": ["-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-"], "keyword_schema": ["UAT", "UAT", "UAT", "UAT", "UAT", "UAT", "UAT", "UAT", "UAT", "UAT", "UAT", "UAT", "arXiv"],

seasidesparrow commented 2 years ago

Normalization should be happening in adspy as a part of Exports. The export codes include a line import ads.Keywords as kw_normalizer and attempt this normalization via normalized_keywords = ', '.join(kw_normalizer.get_normalized_keywords(kws))

In Keywords, this is a straightforward function: def get_normalized_keyword(keyword): """Returns a normalized keyword.""" normalized_keyword = None if keyword in KEYW2NORM: normalized_keyword = KEYW2NORM[keyword].strip() else: normalized_keyword = normalize_keyword(keyword).strip() if normalized_keyword in ASTKEYWORDS: return normalized_keyword.replace(', ', ' ') else: return None

so I wonder if the UAT data aren't being passed via config properly?

seasidesparrow commented 2 years ago

The keyw2norm.pickle file in adspy/etc is dated January 2011, so it predates UAT.