CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

Subscript of formulas. #5

Closed tiagobotari closed 5 years ago

tiagobotari commented 6 years ago

Is it necessary to keep information about subscript or superscript?

zhugeyicixin commented 6 years ago

I think it is not necessary in most cases. But there are few situations subscript/superscript would be helpful. For examples, "(0.70−x)BiFeO3–0.30BaTiO3–xBi(Zn0.5Ti0.5)O3" is (0.70−x) "BiFeO3", 0.30 "BaTiO3" and x "Bi(Zn0.5Ti0.5)O3", while "4−x" in "SrCaBi4−xErxTi5O18" is the number of element Bi. Do you think is it possible to keep using plain text but also save the information about subscript or superscript as a reference? If some complex cases are going to be dealt with, Olga might add this information to assist the materials parsing?

tiagobotari commented 6 years ago

God point Tanjin, I think that is not difficult to include a '_' in the subscript. Thanks

OlgaGKononova commented 6 years ago

I agree with Tanjin - this will help to parse formulas, though I will have to re-write parser. you can use LaTex notations: "_{}" for subscript and "^{}" for superscript. Don't forget to include in parenthesis whatever is sub-/superscripted. Thanks.

vtshitoyan commented 6 years ago

I have a feeling we won't be able to make this universal for all material mentions. E.g. many papers would just write SiO2 without the subscript, especially in abstracts (or is this just my impression?). So even if it is helpful for the extra 5% of the cases, it might be worth considering the introduced complications and non consistencies.

shaunrong commented 6 years ago

Since everything is python processed already, why not just use pymatgen.Composition objects for all the formulas?

Note it has de/serialization methods fromt_dict and as_dict that can conveniently interface with MongoDB.

For off-stoichiometry formulas, this solution might need some redevelopment effort on top of the existing Composition class. I'm not sure if Shyue has plans for it, but we can always ask him before heading in this direction.

OlgaGKononova commented 6 years ago

@shaunrong This is already done in MaterialParser. There is a method based on pymatgen.Composition to extract composition from non-trivial formulas, where pymatgen.Composition fails. The only benefit of keeping sub-/superscripts information in formulas it will be helpful to identify compound vs. mixture, as Tanjin mentioned.

zhugeyicixin commented 6 years ago

I agree with Vahe that the mentions of materials in different papers don't follow a uniform format. So can we still use plain text to get the best consistency, while saving the sub-/superscripts as a supplementary field? Then we can still use our current parser, and only use the sub-/superscripts if we want to deal with those difficult cases. For example, "SiO2" -> {'text': 'SiO2', 'comment': None}, "SiO2" -> {'text': SiO2, 'comment': 'SiO_{2}'}.

shaunrong commented 6 years ago

Perhaps I didn't express myself clearly enough. And my thoughts are pretty preliminary. I'm not even sure it's worth doing it. But maybe only food for thought here:

My key point is rather than storing materials in a plain raw text, whether writing superscript/subscript using which kinds of formats, these formats still fall into raw text category. Maybe it's a better idea to encapsulate materials using a richer data structure, whether we customized them in a Composition or other more complicated classes (like @OlgaGKononova did in her MaterialParser), that can encode information other than just raw formula string. Maybe we can attached a property map to it. And since essentially python implement every class in a dictionary, it shouldn't be too difficult to de/serialize this class with MongoDB. It's similar to what Tanjin said here in the previous comment, essentially he's expanding a raw string to a richer encoded information contained in a dictionary, which can be treated in an object in python codes.

E.g. for the composite, the object stores the composite ratio and three pointers to three materials objects, i.e. BiFeO3, BiTiO3, Bi(Zn0.5Ti0.5)O3. The benefits of this approach may be we are planning to associate every material with a property graph. A richer encoded materials object can open the door for us to more conveniently associate that property graph in a paragraph natural language context.

hhaoyan commented 5 years ago

This issue is inactive for a long time. Currently formulas are represented using a plain string. If new features are required please reopen it or open a new issue.