CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

Feedback on the Springer Parser #20

Closed vtshitoyan closed 5 years ago

vtshitoyan commented 5 years ago

Here is some feedback from Tanjin who analyzed the results for the Springer Parser based on a few papers.

  1. Many blanks are inserted, especially when dealing with subscripts/superscripts. This makes it difficult to correctly parse chemical formula. E.g.: Pb(Zr x Ti 1− x )O 3 Pb 0.97 Nd 0.02 (Zr 0.55 Ti 0.45 )O 3 (PNZT) ScTaO 4 Ar + ion Mg 2 Ni 7.49 × 10 3 kg/m 3 1.5 J/cm 2 CuK α k -space

  2. Paragraphs in the same section are not separated E.g.: Introduction section of the paper 10.1007/s00339-013-8138-9.

  3. References are not removed

  4. Some text is missed in a section with sub-sections. E.g.: Methods section missed for the paper 10.1007/bf01142064.

  5. I am not sure if we need to keep the formula in same format? E.g. Some formula starts and ends with "$$", which some starts and ends with "\(" as the boundary. Formula 1: $$ \sigma{\text{wh}} = \sqrt { \sigma{\text{sat}}^{2} - \left( {\sigma{\text{sat}}^{2} - \sigma{0}^{2} } \right)\exp ( - r(\varepsilon - \varepsilon_{0} ))} $$ Formula 2: \( {\dot{{\varepsilon }}} \)?

I think we should at least address the first 4 points. Happy to discuss this further.

OlgaGKononova commented 5 years ago

On #5: usually when LaTex markups are embedded in HTML/XML text they also have a tag with plane text. So the easiest way is to substitute all the LaTex span with the string under plane text tag.

eddotman commented 5 years ago

Would it be possible to provide a couple examples of the raw HTML / XML for the inserted blanks? That might make it easier / quicker to narrow down that issue.

shaunrong commented 5 years ago

Do we still have dois for issue 1, 3 and 5? So proper unit test can be developed for the fix. @zhugeyicixin @vtshitoyan

zhugeyicixin commented 5 years ago

Do we still have dois for issue 1, 3 and 5? So proper unit test can be developed for the fix. @zhugeyicixin @vtshitoyan

Yes. Here are some example dois.

For 1, "10.1007/s00339-013-8138-9" and "10.1007/bf02663182" For 3, "10.1007/s00339-013-8138-9" and "10.1007/s10853-011-5258-5" For 5, "10.1007/s10853-015-9171-1"

shaunrong commented 5 years ago

@zjensen262 please refer these DOIs for unit tests in #26

IAmGrootel commented 5 years ago

@zhugeyicixin thanks for the list of dois, they were very helpful to isolate and resolve the issues. I think I have fixed the first four issues as mentioned above, and I am going through some additional papers to verify.

For some reason we cannot download the html version of 10.1007/bf02663182. Could you send us the html file of this doi if you have it on hand?

zhugeyicixin commented 5 years ago

@IAmGrootel Hi Alex, here is the html file for 10.1007/bf02663182.

paper_10.1007_bf02663182.txt

@zhugeyicixin thanks for the list of dois, they were very helpful to isolate and resolve the issues. I think I have fixed the first four issues as mentioned above, and I am going through some additional papers to verify.

For some reason we cannot download the html version of 10.1007/bf02663182. Could you send us the html file of this doi if you have it on hand?

IAmGrootel commented 5 years ago

Thanks! I just submitted a pull request. Let me know if there are further issues.

hhaoyan commented 5 years ago

Please close this issue if you think the problems have been solved @zhugeyicixin Thanks!