CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
120 stars 28 forks source link

NmrSpectrum is not work best #25

Open maoliyun opened 1 year ago

maoliyun commented 1 year ago

Hey, I installed ChemDataExtractor2 on Win 11 using pip. The installation finished without error, however, when I used NmrSpectrum, when I using my own data, the output results are incorrect and incomplete, but using the data used in the test, the output results are better. Her e is my code : import chemdataextractor from chemdataextractor import Document from chemdataextractor.doc import Document, Heading, Paragraph from chemdataextractor.model import Compound, NmrSpectrum, MeltingPoint, NmrPeak doc = Document( Heading('1H NMR (300 MHz, CDCl3), 1.00 (t, J = 7.3 Hz, 3H), 1.50 (m, 2H), 1.77 (m, 2H), 2.42 (s, 3H), 2.83–2.71 (m, 3H), 6.36 (s, 1H), 7.26 (d, J = 8.7 Hz, 2H), 7.30–7.39 (m, 7H), 7.50 (d, J = 8.2 Hz, 2H), 7.54 (d, J = 8.7 Hz, 2H);'), Paragraph('1H NMR (300 MHz, CDCl3), 8.00 (d, J = 9.5 Hz, H-4 or H-5), 7.87 (d, J = 8.0 Hz, NH), 7.61 ( d, J = 8.5Hz, 2H of C6H4CN), 7.56 (d, J = 9.0Hz, 2H of C6H4CF3);'), Paragraph('1H NMR (CDCl3 with 0.05% v/v TMS, 400 MHz): δH 7.10 (2H, d, J = 8.9 Hz, H2′ and H6′), ' \ '7.03-7.07 (3H, m, H3′′, H4′′ and H5′′), 6.83-6.85 (2H, m, H2′′ and H6′′), ' \ '6.66 (2H, d, J = 8.9 Hz, H3′ and H5′), 6.42 (1H, d, J = 1.8 Hz, H5), 6.26 (1H, d, J = 1.7 Hz, H7), ' \ '5.18 (1H, s, H1′′′), 5.01 (1H, d, J = 6.6 Hz, H1), 4.52 (1H, s, H2′′′), 4.27 (1H, d, J = 14.2 Hz, H3), ' \ '4.15 (1H, br d, J = 11.2 Hz, H4′′′), 4.05 (1H, t, J = 11.2 Hz, H3b′′′), 3.88 (1H, J = 14.3, 6.8 Hz, H2), ' \ '3.86 (3H, s, OCH38), 3.69 (3H, s, OCH34′), 3.64 (3H, s, COOCH32), 3.49 (3H, br s, H5′′′ and H6′′′), ' \ '3.43-3.47 (1H, overlapped, H3a′′′), 3.45 (3H, s, OCH32′′′).'), models=[NmrSpectrum]) doc[1].records.serialize() out: [{'Compound': {'names': ['CDCl3']}}, {'Compound': {'names': ['H']}}, {'Compound': {'names': ['NH']}}, {'Compound': {'names': ['C6H4CN']}}, {'Compound': {'names': ['C6H4CF3']}}, {'NmrSpectrum': {'nucleus': '1H', 'solvent': 'CDCl3', 'frequency': '300', 'frequency_units': 'MHz', 'peaks': [{'NmrPeak': {'shift': '8.00'}}], 'compound': {'Compound': {}}}}]

Can you offer some advise?

ViktorWeissenborn commented 1 year ago

Hello I encountered similar problems as described by maoliyun above. I wanted to try out some models that come by default with CDE2, but NmrSpectrum nor NmrPeak model seem to work. In my code example below I created a similar case as described by maoliyun, but in my case only results from the Compound model are shown.

from chemdataextractor.doc import Document,Heading, Paragraph
from chemdataextractor.model import Compound, NmrSpectrum

nmr_heading = ("4.22. (E)-3-Hexylidene-4-methylenedihydrofuran-2,5-dione 40")

nmr_para1 = ("1H NMR (300 MHz, CDCl3), 8.00 (d, J = 9.5 Hz, H-4 or H-5), "
            "7.87 (d, J = 8.0 Hz, NH), 7.61 ( d, J = 8.5Hz, 2H of C6H4CN), "
            "7.56 (d, J = 9.0Hz, 2H of C6H4CF3);")

nmr_para2 = ("1H NMR (CDCl3 with 0.05% v/v TMS, 400 MHz): "
            "δH 7.10 (2H, d, J = 8.9 Hz, H2′ and H6′), "
            "7.03-7.07 (3H, m, H3′′, H4′′ and H5′′), "
            "6.83-6.85 (2H, m, H2′′ and H6′′), 6.66 (2H, d, J = 8.9 Hz, H3′ and H5′), "
            "6.42 (1H, d, J = 1.8 Hz, H5), 6.26 (1H, d, J = 1.7 Hz, H7), "
            "5.18 (1H, s, H1′′′), 5.01 (1H, d, J = 6.6 Hz, H1), "
            "4.52 (1H, s, H2′′′), 4.27 (1H, d, J = 14.2 Hz, H3), "
            "4.15 (1H, br d, J = 11.2 Hz, H4′′′), "
            "4.05 (1H, t, J = 11.2 Hz, H3b′′′), "
            "3.88 (1H, J = 14.3, 6.8 Hz, H2), "
            "3.86 (3H, s, OCH38), 3.69 (3H, s, OCH34′), "
            "3.64 (3H, s, COOCH32), 3.49 (3H, br s, H5′′′ and H6′′′), "
            "3.43-3.47 (1H, overlapped, H3a′′′), 3.45 (3H, s, OCH32′′′).")

doc = Document(Heading(nmr_heading), 
               Paragraph(nmr_para1), 
               Paragraph(nmr_para2),
               models=[Compound, NmrSpectrum])

nmr_records = doc.records.serialize()

def main():
    for entry in nmr_records: 
        print(entry)

if __name__ == '__main__':
    main()

As described the output only consists of output from the Compound model:

{'Compound': {'names': ['(E)-3-Hexylidene-4-methylenedihydrofuran-2,5-dione']}}
{'Compound': {'names': ['CDCl3']}}
{'Compound': {'names': ['H']}}
{'Compound': {'names': ['NH']}}
{'Compound': {'names': ['C6H4CN']}}
{'Compound': {'names': ['C6H4CF3']}}
{'Compound': {'names': ['TMS']}}
{'Compound': {'names': ['H2 ′']}}
{'Compound': {'names': ['H6']}}
{'Compound': {'names': ['3H']}}
{'Compound': {'names': ['H3']}}
{'Compound': {'names': ['H4 ′']}}
{'Compound': {'names': ['H5']}}
{'Compound': {'names': ['H1']}}
{'Compound': {'names': ['H2']}}
{'Compound': {'names': ['OCH38']}}
{'Compound': {'names': ['OCH34']}}
{'Compound': {'names': ['COOCH32']}}
{'Compound': {'names': ['OCH32']}}
{'Compound': {'labels': ['v']}}
{'Compound': {'labels': ['3H']}}

Is the code so far correct to call the NmrSpectrum model? I am using a Macbook and CDE2.1.2 installed without any problems

OBrink commented 1 year ago

@ti250, do you have any idea why the NMR parser is not working here? The examples in the code from @ViktorWeissenborn are taken from https://github.com/CambridgeMolecularEngineering/chemdataextractor2/blob/master/tests/test_parse_nmr.py.

@maoliyun, have you found a solution?

OBrink commented 1 year ago

Just dumping information here to narrow down the problem:

When running the parser manually, everything seems to work fine:

from chemdataextractor.doc.text import Sentence
from chemdataextractor.parse.nmr import nmr
from lxml import etree

nmr_para2 = ("1H NMR (CDCl3 with 0.05% v/v TMS, 400 MHz): "
            "δH 7.10 (2H, d, J = 8.9 Hz, H2′ and H6′), "
            "7.03-7.07 (3H, m, H3′′, H4′′ and H5′′), "
            "6.83-6.85 (2H, m, H2′′ and H6′′), 6.66 (2H, d, J = 8.9 Hz, H3′ and H5′), "
            "6.42 (1H, d, J = 1.8 Hz, H5), 6.26 (1H, d, J = 1.7 Hz, H7), "
            "5.18 (1H, s, H1′′′), 5.01 (1H, d, J = 6.6 Hz, H1), "
            "4.52 (1H, s, H2′′′), 4.27 (1H, d, J = 14.2 Hz, H3), "
            "4.15 (1H, br d, J = 11.2 Hz, H4′′′), "
            "4.05 (1H, t, J = 11.2 Hz, H3b′′′), "
            "3.88 (1H, J = 14.3, 6.8 Hz, H2), "
            "3.86 (3H, s, OCH38), 3.69 (3H, s, OCH34′), "
            "3.64 (3H, s, COOCH32), 3.49 (3H, br s, H5′′′ and H6′′′), "
            "3.43-3.47 (1H, overlapped, H3a′′′), 3.45 (3H, s, OCH32′′′).")

s = Sentence(nmr_para2)
result = next(nmr.scan(s.tokens))[0]
etree.tostring(result, encoding='unicode')

-->

'<nmr><nucleus>1H</nucleus><solvent>CDCl3 with 0.05 % v/v TMS</solvent><frequency><value>400</value><units>MHz</units></frequency><peaks><peak><shift>7.10</shift><number>2H</number><multiplicity>d</multiplicity><coupling><value>8.9</value><units>Hz</units></coupling><assignment>H2 ′</assignment><assignment>H6 ′</assignment></peak><peak><shift>7.03-7.07</shift><number>3H</number><multiplicity>m</multiplicity><assignment>H3 ′ ′</assignment><assignment>H4 ′ ′</assignment><assignment>H5 ′ ′</assignment></peak><peak><shift>6.83-6.85</shift><number>2H</number><multiplicity>m</multiplicity><assignment>H2 ′ ′</assignment><assignment>H6 ′ ′</assignment></peak><peak><shift>6.66</shift><number>2H</number><multiplicity>d</multiplicity><coupling><value>8.9</value><units>Hz</units></coupling><assignment>H3 ′</assignment><assignment>H5 ′</assignment></peak><peak><shift>6.42</shift><number>1H</number><multiplicity>d</multiplicity><coupling><value>1.8</value><units>Hz</units></coupling><assignment>H5</assignment></peak><peak><shift>6.26</shift><number>1H</number><multiplicity>d</multiplicity><coupling><value>1.7</value><units>Hz</units></coupling><assignment>H7</assignment></peak><peak><shift>5.18</shift><number>1H</number><multiplicity>s</multiplicity><assignment>H1 ′ ′ ′</assignment></peak><peak><shift>5.01</shift><number>1H</number><multiplicity>d</multiplicity><coupling><value>6.6</value><units>Hz</units></coupling><assignment>H1</assignment></peak><peak><shift>4.52</shift><number>1H</number><multiplicity>s</multiplicity><assignment>H2 ′ ′ ′</assignment></peak><peak><shift>4.27</shift><number>1H</number><multiplicity>d</multiplicity><coupling><value>14.2</value><units>Hz</units></coupling><assignment>H3</assignment></peak><peak><shift>4.15</shift><number>1H</number><multiplicity>br d</multiplicity><coupling><value>11.2</value><units>Hz</units></coupling><assignment>H4 ′ ′ ′</assignment></peak><peak><shift>4.05</shift><number>1H</number><multiplicity>t</multiplicity><coupling><value>11.2</value><units>Hz</units></coupling><assignment>H3b ′ ′ ′</assignment></peak><peak><shift>3.88</shift><number>1H</number><coupling><value>14.3 , 6.8</value><units>Hz</units></coupling><assignment>H2</assignment></peak><peak><shift>3.86</shift><number>3H</number><multiplicity>s</multiplicity><assignment>OCH38</assignment></peak><peak><shift>3.69</shift><number>3H</number><multiplicity>s</multiplicity><assignment>OCH34 ′</assignment></peak><peak><shift>3.64</shift><number>3H</number><multiplicity>s</multiplicity><assignment>COOCH32</assignment></peak><peak><shift>3.49</shift><number>3H</number><multiplicity>br s</multiplicity><assignment>H5 ′ ′ ′</assignment><assignment>H6 ′ ′ ′</assignment></peak><peak><shift>3.43-3.47</shift><number>1H</number><note>overlapped</note><assignment>H3a ′ ′ ′</assignment></peak><peak><shift>3.45</shift><number>3H</number><multiplicity>s</multiplicity><assignment>OCH32 ′ ′ ′</assignment></peak></peaks></nmr>'

OBrink commented 1 year ago

Dumping more information to further narrow down the problem. This is not a general problem with all spectrum parsers. The Uvvisparser works fine:

from chemdataextractor.doc import Document, Heading, Paragraph
from chemdataextractor.model import UvvisSpectrum

doc = Document(Paragraph('λabs/nm 320, 380, 475, 529;'), models = [UvvisSpectrum])
doc.records.serialize()

-->

[{'UvvisSpectrum': {'peaks': [{'UvvisPeak': {'value': '320', 'units': 'nm'}},
    {'UvvisPeak': {'value': '380', 'units': 'nm'}},
    {'UvvisPeak': {'value': '475', 'units': 'nm'}},
    {'UvvisPeak': {'value': '529', 'units': 'nm'}}],
   'compound': {'Compound': {}}}}]
OBrink commented 1 year ago

I found the solution for the problem where no NMR data was returned at all by doc.records.serialize (#38).

Unfortunately, this does not solve @maoliyun original problem. When I run his code snippet, I can reproduce now that we only the first peak is parsed.