CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
120 stars 28 forks source link

Compound model does not detect molecules starting with a number in tables #22

Open Spadet opened 2 years ago

Spadet commented 2 years ago

Hi,

I've been testing the AutoTableParser capabilities lately and I encountered an issue with the Compound model. Indeed, I created a parser for reduction potential and added a nested model for compound. However, some rows were not matched (a value is detected but not the associated compound when "required=False" in the compound model). The compound model alone is also not able to detect the molecules. This problem is strange since the AutoSentenceParser is able to detect those molecules in sentences. I realized that the molecules all begin with a number and I created a small csv table to illustrate my problem if anyone wants to test :

Additives,Reduction potential (V)
Succinic anhydride,1.33
Vinylboronic acid pinacol ester,1.09
"4-Fluoro-1,3-dioxolan-2-one",1.14
"4-Vinyl-1,3-dioxolan-2-one",1.07

The 2 latest molecules are not detected by my model while the first 2 are.

If anyone has a suggestion I would be glad to hear about it ! Thanks !

cjcourt commented 2 years ago

Hi @Spadet Can you check the tokenisation for those molecules? My first guess is that the tokeniser is splitting around the hyphen in these cases and then the Compound rules fail to match. It could also be that the compounds contain commas? @JurajMa?

JurajMa commented 2 years ago

Hi @Spadet Can you share your model classes, for both the Compound model and the other model? Is the compound detected when you define required=True in the Compound model?

Spadet commented 2 years ago

Hi, thanks both of you for your answers !

Hi @Spadet Can you check the tokenisation for those molecules? My first guess is that the tokeniser is splitting around the hyphen in these cases and then the Compound rules fail to match. It could also be that the compounds contain commas? @JurajMa?

Here you can find the elements of the table : [Cell('1.33 🙃🙃🙃🙃 Succinic anhydride 🙃🙃🙃🙃 Reduction potential (V)', 0, 57), Cell('1.09 🙃🙃🙃🙃 Vinylboronic acid pinacol ester 🙃🙃🙃🙃 Reduction potential (V)', 0, 70), Cell('1.14 🙃🙃🙃🙃 4-Fluoro-1,3-dioxolan-2-one 🙃🙃🙃🙃 Reduction potential (V)', 0, 66), Cell('1.07 🙃🙃🙃🙃 4-Vinyl-1,3-dioxolan-2-one 🙃🙃🙃🙃 Reduction potential (V)', 0, 65), Cell('Succinic anhydride 🙃🙃🙃🙃 🙃🙃🙃🙃 \ufeffAdditives', 0, 40), Cell('Vinylboronic acid pinacol ester 🙃🙃🙃🙃 🙃🙃🙃🙃 \ufeffAdditives', 0, 53), Cell('4-Fluoro-1,3-dioxolan-2-one 🙃🙃🙃🙃 🙃🙃🙃🙃 \ufeffAdditives', 0, 49), Cell('4-Vinyl-1,3-dioxolan-2-one 🙃🙃🙃🙃 🙃🙃🙃🙃 \ufeffAdditives', 0, 48), Caption(id=None, references=[], text='')]

One element failing to be recorded is :

Cell('1.07 🙃🙃🙃🙃 4-Vinyl-1,3-dioxolan-2-one 🙃🙃🙃🙃 Reduction potential (V)', 0, 65)

Once tokenized I obtain this result : [RichToken('1.07', 0, 4), RichToken('🙃🙃🙃🙃', 0, 0), RichToken('4', 0, 1), RichToken('-', 1, 2), RichToken('Vinyl', 2, 7), RichToken('-', 7, 8), RichToken('1,3', 8, 11), RichToken('-', 11, 12), RichToken('dioxolan', 12, 20), RichToken('-', 20, 21), RichToken('2', 21, 22), RichToken('-', 22, 23), RichToken('one', 23, 26), RichToken('🙃🙃🙃🙃', 0, 0), RichToken('Reduction', 0, 9), RichToken('potential', 10, 19), RichToken('(', 20, 21), RichToken('V', 21, 22), RichToken(')', 22, 23)]

Note : if I remove the "4-" at the beggining of the molecule it is detected correctly by the model. The commas are normally not a problem since the molecule is quoted in quotation marks right ?

Hi @Spadet Can you share your model classes, for both the Compound model and the other model? Is the compound detected when you define required=True in the Compound model?

I used the base Compound model delivered with CDE 2.1.2. Regarding my other model, it is one I designed to retrieve reduction potential :

compound = ModelType(Compound, required=True, contextual=True)
specifier_expression =( R('E(°|0)?red(uction|ox)?') | R('E(°|0)?') + R('red(uction|ox)?') | (I('electrochemical') | I('(ir)?reversible')) + I('reduction') | (I('redox') | I('reduction') | I('equilibrium')) + Optional(R('peak')) + R('(p|P)otentials?')).add_action(join)
specifier = StringType(parse_expression=specifier_expression, required=True)
parsers = [AutoTableParser()]

When parsing the table with 'required=True', I can retrieve the two first molecules (and the values), but not the 2 last one. If set on 'False', I can retrieve the 4 reduction potential (and still the 2 first molecules, but not the last ones).