CambridgeMolecularEngineering / chemdataextractor2

ChemDataExtractor Version 2.0
Other
121 stars 28 forks source link

Extracting specific reasonable set of information #62

Open OPPOSITEFOOLS opened 1 month ago

OPPOSITEFOOLS commented 1 month ago

Hello,

I am currently working on this project on extracting the catalyst information including the name mass temperature, and also the performance of the experiment using this catalyst. I have checked the documentation, and I understand how I can use the chemdataextractor to extract values which are close to the words I specify.

There are two main problems. First, sometimes for example catalyst mass.

"Sr-La2O3 nanofibers were synthesized by incipient wetness impregnation. In a typical procedure, different amounts of Sr(NO3)2 were dissolved in nanopure water, and then added in a drop-wise manner to a certain amount of La(OH)3 nanofibers prepared above until a paste was obtained. The samples were then dried at 65 °C and calcinated at 800 °C for 4 h to obtain the Sr-La2O3 nanofibers. K-La2O3, Ba-La2O3, and Ca-La2O3 nanofibers were prepared by similar method with Sr-La2O3 nanofibers. The catalytic measurements were carried out in a laboratory quartz fixed bed reactor (9 mm inner diameter, 14 mm outer diameter, 50 cm length). Each catalyst was pressed under 30 MPa, then crushed and sieved to a size range of 40-80 mesh. For loading the reactor, 0.2 g of the sieved catalyst was mixed with 0.8 g quartz sand, and then placed in the quartz reactor tube between two pieces of quartz wool."

For example here the actual catalyst mass value is near the word catalyst, but for the name of the catalyst is rather further away from the value. So how should the compound name associated with the value if they are far away from each other?

Second, there is another sentence says 'The catalyst powder (150 mg) was heated a He flow from 60 to 800 °C at a heating rate of 10 °C /min and keep for 60 min at 800 °C' while this part is kind of not important to what I am trying to do here, but the parser would also take that as a mass, is there any ways to determine which one is useful? Adding more regex requirements?

I think my requirements are more language model/ ML things, will the snowball mentioned in the documentation work?

Dingyun-Huang commented 1 month ago

Is there any ways to determine which one is useful?

This is a really subjective condition. You will have to know what is 'useful to you' explicitly to figure out a solution.

For example here the actual catalyst mass value is near the word catalyst, but for the name of the catalyst is rather further away from the value. So how should the compound name associated with the value if they are far away from each other?

Take a look at the multi-turn Q&A method here https://www.nature.com/articles/s41597-023-02511-6. Or block the unwanted chemical names using a blocklisting method https://www.nature.com/articles/s41597-023-02897-3.