CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
121 stars 28 forks source link

Missing keywords under doc.elements in newest version 2.2.1 #44

Closed ViktorWeissenborn closed 8 months ago

ViktorWeissenborn commented 11 months ago

Hello (:

I gave the new version a brief try by using doc.elements on the following elsevier publication:

https://doi.org/10.1016/j.chemosphere.2014.05.068

In the direct comparison of version 2.1.2 and 2.2.1 the newest version seems to perform a cleaner text extraction, but in my given example it did not extract "Keywords" as Heading object and the corresponding keywords as Paragraph objects. Being able to extract the keywords from the document was kind of useful.

I am not sure if this is a Bug or if this is intended.

Below is a part with the outputs of doc.elements of both versions in comparison:

Version 2.1.2:

... Paragraph(id='tm005', references=[], text='QSAR models for ...), Paragraph(id='ms005', references=[], text='Handling Editor: I. Cousins'), Paragraph(id='sp0010', references=[], text='Although some researches ...')

Heading(id='st050', references=[], text='Keywords'), <-- "KEYWORDS" IN HEADING OBJECT Paragraph(id='k0005', references=[], text='Ozonation process'), <-- KEYWORD IN PARAGRAPH OBJECT Paragraph(id='k0010', references=[], text='Organic pollutants'), Paragraph(id='k0015', references=[], text='QSAR'), Paragraph(id='k0020', references=[], text='Fukui indices'), Paragraph(id='k0025', references=[], text='Quantum chemistry'), Paragraph(id='k0030', references=[], text='Reaction pathway'), Paragraph(id='s0005', references=[], text='1'),

Heading(id='st005', references=[], text='Introduction'), Paragraph(id='p0005', references=['De Witte et al., 2010; Krasner et al., 2013', 'Ning and Graham, 2008; Tachibana et al., 2011'], text='With the development of modern industry, a variety ...'), ...

Version 2.2.1:

Paragraph(id='tm005', references=[], text='QSAR models for ...'), Paragraph(id='ms005', references=[], text='Handling Editor: I. Cousins'), Paragraph(id='sp0010', references=[], text='Although some researches ...'), <-- MISSING KEYWORDS Heading(id='st005', references=[], text='Introduction'), Paragraph(id='p0005', references=[], text='With the development of modern industry, a variety ...'),

Dingyun-Huang commented 10 months ago

Hi Viktor,

The behaviour change is probably because we ignore CSS tags of "ce|keywords" in the v2.2.1 in the script /chemdataextractor/reader/elsevier.ElsevierXmlReader.ignore_css. I will double check with the team if this is intended and what is the reason.

ViktorWeissenborn commented 10 months ago

Hey Dingyun!

That would be great! Thanks for your efforts! I would really appreciate it if you could let me know when you checked if this is intended. I'm looking forward to hearing from you (:

Dingyun-Huang commented 10 months ago

Hi Viktor,

I haven't got a response yet. But there is a way to get around with it, with one line of change in the installed package. Can you tell me how did you install chemdataextractor2 please? (conda, pip, any virtual environments, or git clone, etc.)

ViktorWeissenborn commented 8 months ago

Hey Dingyun!

Sorry for the late response, I installed CDE2 via conda. Thanks for creating parse-keywords-elsevier branch, this already helped me a lot!

Dingyun-Huang commented 8 months ago

Hi Viktor,

Glad to hear that it worked around for you. Yes, the quick hack would be go to your conda environment site-packages and find chemdataextractor. Then add the changes as in parse-keywords-elsevier branch. If you're happy for the solution, I will close this issue. Hopefully, we can merge this branch in near future.

Dingyun

ViktorWeissenborn commented 8 months ago

Hey Dingyun,

yes, I am very happy with the changes, you can close the issue if u want, thanks again! (:

greetings Viktor