A user points out that we don't index the + an - symbols found in chemical formulas (e.g. He+). This also applies to astronomical object names of the kind J1234+5678. So I went back to look at how classic deals with these cases and this is what I found in its indexing tokenizer test suite. The repetition of strings is classic's way of creating multiple tokens to be indexed from the same input text. The / appended at the end of a string indicates an acronym.
INPUT: chemical elements: Fe iv and Mg IX and Si III fex.
OUTPUT: CHEMICAL ELEMENTS FEIV MGIX SIIII FEX CHEMICAL ELEMENTS FE IV MG IX/ SI III/ FEX CHEMICAL ELEMENTS FEIV MGIX SIIII FEX CHEMICAL ELEMENTS FE IV MG IX SI III FEX
INPUT: Comets and small planets: C/1999, P/2000, and S/2001 but not X/2002
OUTPUT: COMETS SMALL PLANETS C 1999 P 2000 S 2001 X 2002
INPUT: Planetary objects: (1234) but not ( 5678) nor (GEMS)
OUTPUT: PLANETARY OBJECTS (1234) 5678 NOR (GEMS) PLANETARY OBJECTS 1234 5678 NOR GEMS/ PLANETARY OBJECTS (1234) 5678 NOR (GEMS) PLANETARY OBJECTS 1234 5678 NOR GEMS
INPUT: Find X-ray bursts in galaxy
OUTPUT: FIND XRAY BURSTS GALAXY FIND X RAY BURSTS GALAXY
INPUT: This is a test for the XMM-Newton Acronym
OUTPUT: TEST XMM-NEWTON ACRONYM TEST XMM/ NEWTON ACRONYM TEST XMM-NEWTON ACRONYM TEST XMM NEWTON ACRONYM XMM-
INPUT: Uppercase object with translations: HD 1234 and HD-1234
OUTPUT: UPPERCASE OBJECT TRANSLATIONS HD1234 HD1234 UPPERCASE OBJECT TRANSLATIONS HD/ 1234 HD/ 1234 UPPERCASE OBJECT TRANSLATIONS HD1234/ HD1234/ UPPERCASE OBJECT TRANSLATIONS HD 1234 HD 1234
INPUT: Lowercase object with translations: hd 1234 and hd-1234
OUTPUT: LOWERCASE OBJECT TRANSLATIONS HD1234 HD1234 LOWERCASE OBJECT TRANSLATIONS HD 1234 HD 1234
We should review this list and decide what needs to get implemented in SOLR.
A user points out that we don't index the
+
an-
symbols found in chemical formulas (e.g. He+). This also applies to astronomical object names of the kind J1234+5678. So I went back to look at how classic deals with these cases and this is what I found in its indexing tokenizer test suite. The repetition of strings is classic's way of creating multiple tokens to be indexed from the same input text. The/
appended at the end of a string indicates an acronym.We should review this list and decide what needs to get implemented in SOLR.