clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
31 stars 5 forks source link

Try out scienceparse #21

Closed kwalcock closed 2 years ago

maxaalexeeva commented 2 years ago

If that helps, Becky and Marco wrote a bunch of science-parse related code in this project: https://github.com/ml4ai/automates/blob/master/automates/text_reading/src/main/scala/org/clulab/aske/automates/scienceparse/ScienceParseDocument.scala

kwalcock commented 2 years ago

Thanks. I didn't know about that. I'll take a look.

kwalcock commented 2 years ago

It looks like the automates project uses scienceparse via a local web service running in a docker container. For this project I'd like to stick with this design which doesn't require docker and uses updated scienceparse code as a library. It has a nasty 2GB+ download the first time it gets called, but that's about the size of the docker image anyway, and the test has been disabled so that the download only applies to interested parties.