CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
120 stars 28 forks source link

Implementation of an Abstract Class #46

Closed ViktorWeissenborn closed 2 months ago

ViktorWeissenborn commented 8 months ago

Hello (:

In ChemDataExtractor there are different document classes like the Title class, Heading class, Paragraph class and so on.

For me it would be very handy to also have an "Abstract" class that gives me the abstract of an article as easy as a Heading class gives me the heading and the Title class gives me the title of an article. Currently the Abstract of an article will be included in a Paragraph object and is therefore hard to identify as the abstract. It is often unclear if the extracted text of the paragraph objects under doc.elements is part of the abstract or part of a normal paragraph from another part of the document. Though for elsevier XML documents for example an abstract is clearly defined with its corresponding XML tags inside the XML document.

Would there be an "easy" or "quick" way to implement an Abstract class into ChemDataExtractor?

If so, let me know, I would be happy to take care of it myself, but I am not really sure where to start and how many dependent classes, functions and variables need to be changed...

kind regards Viktor

Dingyun-Huang commented 8 months ago

Hi Viktor,

Unfortunately, there is no easy way to implement an Abstract class comprehensively, because the structures of raw documents from each publisher differ a lot.

But for Wiley, Springer Nature, and Elsevier (maybe more), they have their own API for retrieving abstracts and metadata of their papers. So if you want the abstract, you can grab the DOIs using chemdataextractor and use the API of the publishers to retrieve the abstracts.

Dingyun

ViktorWeissenborn commented 8 months ago

ah okay, makes sense. But lets say I only want to implement an abstract class for Elsevier documents, would this still be a problem?

greetings Viktor

Dingyun-Huang commented 3 months ago

For Elsevier, yes! Elsevier XMLs have a distinct xml decorator for abstracts/graphical abstracts. The files you'd want to change are reader.elsevier and scrape.pub.elsevier.