Tokenization of chemical formulas and other oddities

A user points out that we don't index the + an - symbols found in chemical formulas (e.g. He+). This also applies to astronomical object names of the kind J1234+5678. So I went back to look at how classic deals with these cases and this is what I found in its indexing tokenizer test suite. The repetition of strings is classic's way of creating multiple tokens to be indexed from the same input text. The / appended at the end of a string indicates an acronym.

INPUT:  chemical elements: Fe iv and Mg IX and Si III fex.
OUTPUT: CHEMICAL ELEMENTS FEIV MGIX SIIII FEX CHEMICAL ELEMENTS FE IV MG IX/ SI III/ FEX CHEMICAL ELEMENTS FEIV MGIX SIIII FEX CHEMICAL ELEMENTS FE IV MG IX SI III FEX

INPUT:  Comets and small planets: C/1999, P/2000, and S/2001 but not X/2002
OUTPUT: COMETS SMALL PLANETS C 1999 P 2000 S 2001 X 2002

INPUT:  Planetary objects: (1234) but not ( 5678) nor (GEMS)
OUTPUT: PLANETARY OBJECTS (1234) 5678 NOR (GEMS) PLANETARY OBJECTS 1234 5678 NOR GEMS/ PLANETARY OBJECTS (1234) 5678 NOR (GEMS) PLANETARY OBJECTS 1234 5678 NOR GEMS

INPUT:  Find X-ray bursts in galaxy
OUTPUT: FIND XRAY BURSTS GALAXY FIND X RAY BURSTS GALAXY

INPUT:  This is a test for the XMM-Newton Acronym
OUTPUT: TEST XMM-NEWTON ACRONYM TEST XMM/ NEWTON ACRONYM TEST XMM-NEWTON ACRONYM TEST XMM NEWTON ACRONYM XMM-

INPUT:  Uppercase object with translations: HD 1234 and HD-1234
OUTPUT: UPPERCASE OBJECT TRANSLATIONS HD1234 HD1234 UPPERCASE OBJECT TRANSLATIONS HD/ 1234 HD/ 1234 UPPERCASE OBJECT TRANSLATIONS HD1234/ HD1234/ UPPERCASE OBJECT TRANSLATIONS HD 1234 HD 1234

INPUT:  Lowercase object with translations: hd 1234 and hd-1234
OUTPUT: LOWERCASE OBJECT TRANSLATIONS HD1234 HD1234 LOWERCASE OBJECT TRANSLATIONS HD 1234 HD 1234

We should review this list and decide what needs to get implemented in SOLR.

adsabs / montysolr

Tokenization of chemical formulas and other oddities #165