CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

[AllParsers] Special HTML symbols in parser. #35

Closed hhaoyan closed 5 years ago

hhaoyan commented 5 years ago

For all parsers, pay attention to special HTML symbols in parsed metadata. For example, DOI 10.1002/adsc.201190008 (Wiley) has Advanced Synthesis & Catalysis which should be Advanced Synthesis & Catalysis. This is to be solved in the Wiley parser @zjensen262

@zhugeyicixin Could you find journals in Springer that have similar problems?

zhugeyicixin commented 5 years ago

Sure, I'll do it.

For all parsers, pay attention to special HTML symbols in parsed metadata. For example, DOI 10.1002/adsc.201190008 (Wiley) has Advanced Synthesis & Catalysis which should be Advanced Synthesis & Catalysis. This is to be solved in the Wiley parser @zjensen262

@zhugeyicixin Could you find journals in Springer that have similar problems?

hhaoyan commented 5 years ago

@zjensen262 I saw your format_text implementation in the latest pull request. However, I think a better approach rather than using regular expressions is to use libraries such as https://stackoverflow.com/a/2087446/2310794.

zhugeyicixin commented 5 years ago

I checked the Springer parser and wrote a new test function (see #37). There is no HTML characters in the parsed result. But we might want to pay attention to some special characters which are not readable for humans. Currently, I set a warning in the test function if the character is not one of the "normal sets" (need discussion for a good version): (1) ASCII (2) extended Latin (3) greek letters.

zhugeyicixin commented 5 years ago

As an example of the weird special characters, here are some journal names (from scraping rather than parsing, just used as examples) not good for reading.

Pfl�gers Archiv - European Journal of Physiology Pfl�gers Archiv European Journal of Physiology Fresenius' Zeitschrift f�r Analytische Chemie Monatshefte f�r Chemie/Chemical Monthly Zeitschrift f�r Physik D Atoms, Molecules and Clusters Monatshefte f�r Chemie Chemical Monthly Monatshefte f�r Chemie Monatshefte f�r Chemie / Chemical Monthly Zeitschrift f�r Physik B Condensed Matter Zeitschrift f�r Physik B Condensed Matter and Quanta Zeitschrift f�r Analytische Chemie Archiv f�r Mikrobiologie Langenbecks Archiv f�r Chirurgie Fresenius Zeitschrift f�r Analytische Chemie Zeitschrift f�r Physik Naunyn-Schmiedebergs Archiv f�r Experimentelle Pathologie und Pharmakologie Zeitschrift f�r Lebensmittel-Untersuchung und -Forschung Archiv f�r Elektrotechnik Internationales Archiv f�r Arbeitsmedizin Archiv f�r Toxikologie Zeitschrift f�r Rheumatologie Zeitschrift f�r Physik A Atoms and Nuclei Archiv f�r Klinische und Experimentelle Dermatologie Zeitschrift f�r Physik A Atomic Nuclei Journal of Orofacial Orthopedics / Fortschritte der Kieferorthop�die Naunyn-Schmiedebergs Archiv f�r Pharmakologie und Experimentelle Pathologie Naunyn-Schmiedeberg's Archiv f�r Experimentelle Pathologie und Pharmakologie ≪UML≫ 2000 - The Unified Modeling Language B - Ba … Cu - Zr ≪UML≫ 2001 - The Unified Modeling Language. Modeling Languages, Concepts, and Tools ·Nature «Nature

hhaoyan commented 5 years ago

As an example of the weird special characters, here are some journal names (from scraping rather than parsing, just used as examples) not good for reading.

Pfl�gers Archiv - European Journal of Physiology Pfl�gers Archiv European Journal of Physiology Fresenius' Zeitschrift f�r Analytische Chemie Monatshefte f�r Chemie/Chemical Monthly Zeitschrift f�r Physik D Atoms, Molecules and Clusters Monatshefte f�r Chemie Chemical Monthly Monatshefte f�r Chemie Monatshefte f�r Chemie / Chemical Monthly Zeitschrift f�r Physik B Condensed Matter Zeitschrift f�r Physik B Condensed Matter and Quanta Zeitschrift f�r Analytische Chemie Archiv f�r Mikrobiologie Langenbecks Archiv f�r Chirurgie Fresenius Zeitschrift f�r Analytische Chemie Zeitschrift f�r Physik Naunyn-Schmiedebergs Archiv f�r Experimentelle Pathologie und Pharmakologie Zeitschrift f�r Lebensmittel-Untersuchung und -Forschung Archiv f�r Elektrotechnik Internationales Archiv f�r Arbeitsmedizin Archiv f�r Toxikologie Zeitschrift f�r Rheumatologie Zeitschrift f�r Physik A Atoms and Nuclei Archiv f�r Klinische und Experimentelle Dermatologie Zeitschrift f�r Physik A Atomic Nuclei Journal of Orofacial Orthopedics / Fortschritte der Kieferorthop�die Naunyn-Schmiedebergs Archiv f�r Pharmakologie und Experimentelle Pathologie Naunyn-Schmiedeberg's Archiv f�r Experimentelle Pathologie und Pharmakologie ≪UML≫ 2000 - The Unified Modeling Language B - Ba … Cu - Zr ≪UML≫ 2001 - The Unified Modeling Language. Modeling Languages, Concepts, and Tools ·Nature «Nature

This is due to the encoding of files. Perhaps the HTML file is encoded in ISO 8859-1 while you opened the file in UTF-8. The weird symbol is the non-ASCII European language alphabets, such as the German alphabet "ü", "ä", etc.

hhaoyan commented 5 years ago

The test function get_non_ascii_latin in soup_tester.py currently checks for non ascii or latin chars. Since these problems are not due to the parser, consider removing it from the unit test. Maybe start a issue in the scraper repo.

hhaoyan commented 5 years ago

An update: maybe check for special characters only. see https://en.wikipedia.org/wiki/Specials_(Unicode_block)

many non ascii or non latin chars are actually useful, such as '≈', '∞'...

rolling back the test function...

hhaoyan commented 5 years ago

resolved 5feae369f0245d102c3497f43884996fab7a55db.