CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

Formulas in gif format #7

Open OlgaGKononova opened 6 years ago

OlgaGKononova commented 6 years ago

I found, that some ECS papers has gif pictures for formulas and numbers. For example: http://jes.ecsdl.org/content/157/3/J69.full span class="inline-formula" id="inline-formula-38"><img class="math mml" alt="Formula" src="J69/embed/mml-math-38.gif"

Can we check how many of those cases and do something about it? Thank you.

shaunrong commented 6 years ago

Nice catch @OlgaGKononova ! TY.

This can be quickly fixed with writing an OCR ingredient targeting these gifs and converting them to string formats using pytesseract.

@tiagobotari let me know if you can take care of this issue. If not, I will push an ingredient component.

hhaoyan commented 5 years ago

This MongoDB query gets:

db.getCollection('Paper_Raw_HTML').find({Publisher: 'ECS', Paper_Raw_HTML: /img alt="Formula"/}).count()

3019 papers have this issue:

broken_dois.txt