CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
121 stars 28 forks source link

abstract.records[i].serialize() #56

Closed loilisxka closed 3 months ago

loilisxka commented 3 months ago

Hi, I'm trying to use cde to extract compound names from the literature. However, when I use compound to extract, the program cannot extract words such as "low-density polyethylene". Other situations are normal. I want to cover these nouns by modifying the source code. Please tell me where should I modify the code? Looking forward to your reply.

Dingyun-Huang commented 3 months ago

Please post your script, console output, and your expected behaviour for the code?

loilisxka commented 3 months ago

OK. My script file is in the attachment. The target text being processed is the abstract section of a paper. This script is able to capture some chemical entities, but some are missed. We want it to be available in full, like low-density polyethylene and so on.

------------------ 原始邮件 ------------------ 发件人: "CambridgeMolecularEngineering/chemdataextractor2" @.>; 发送时间: 2024年5月15日(星期三) 晚上6:44 @.>; @.**@.>; 主题: Re: [CambridgeMolecularEngineering/chemdataextractor2] abstract.records[i].serialize() (Issue #56)

Please post your script, console output, and your expected behaviour for the code?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Dingyun-Huang commented 3 months ago

Do you mind post these on github.com, as the attached files cannot go through by replying the GitHub notification email?

You can you code snippet like this

from chemdataextracor.doc import Document

Or you can directly attach your file.

loilisxka commented 3 months ago

Okay, excuse me. I'll re-upload my code ` from chemdataextractor import Document from chemdataextractor.parse import R, I, W, Optional, merge from chemdataextractor.reader import ElsevierXmlReader, HtmlReader, PlainTextReader from chemdataextractor.model.model import Compound from chemdataextractor.doc import Paragraph

reader = PlainTextReader()

abstract = Document("Molded seal devices made of crystalline polymers are widely used in high-pressure hydrogen equipment. \ A method for evaluating high-pressure hydrogen permeability was recently reported; however, the evaluation \ cost is extremely high. To select suitable crystalline polymers for molded hydrogen seals or barrier devices, \ a high-pressure hydrogen permeability prediction method using the polymer structure and its conven- tional \ properties is required. In this study, we measured the pressure dependency of the hydrogen permeability of \ lowd-density polyethylene (LDPE), high-density polyethylene (HDPE), and polyamide11 (PA11). We constructed \ the permeation model for crystalline polymers in terms of the tortuosity induced by their higher-order \ structures and free volume change in the amor- phous region evaluated using PVT method for measuring the \ relationship between pres- sure (p), specific volume (v) and temperature (T) in the molten-solid state of a \ polymer. The results of the pressure dependency of hydrogen permeability were reproduced by the developed \ permeation model.")

abstract.models = [Compound]

print(abstract.cems[0]) print("The keywords of Abstract:") print(abstract.records) for i in range(len(abstract.records)): print(abstract.records[i].serialize()) ` I would like to know how to modify the code or source code to identify all chemical entities. If you want to modify the source code, which part should be modified?

Dingyun-Huang commented 3 months ago

In short, you can modify the parsing phrases in chemdataextractor.parse.cem_factory to include adjectives like 'low density' in your text.

loilisxka commented 3 months ago

Thanks, I still want to know more details about cem_factory. What parameters are used to add adjectives, and how does it work? Which link's results will be passed to cem_facotry for processing?

Dingyun-Huang commented 3 months ago

A parser object has a root function to generate a set of parsing rules, where the elements are coming from cem_factory E.g., line 199 in chemdataextractor.parse.cem

@property
def root(self):

You can add new parsing rules to the root phrase. For instance, something like W('low') + W('density') + original_rule and added to the returned expression of root.

loilisxka commented 3 months ago

Thank you very much, this helps me a lot. The code successfully captured the chemical entity.