bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Run on CSD3 #24

Closed kpu closed 4 years ago

kpu commented 4 years ago

You have the instructions to get a login on CSD3. Please produce a working version on there.

Presumably depends on #22.

lpla commented 4 years ago

cmake version in csd3 is quite old (2.8.12.2). Downloading a newer version binary fixes this issue (I downloaded latest 3.16.5 and put in my ~/bin/, then added it to my $PATH). Also, mvn is needed and it is not installed in CSD3, so I did the same.

Once these dependencies are fulfilled, you can install pdf-extract Python wrapper:

python3 -m venv paracrawlenv
pip install git+https://github.com/bitextor/python-pdfextract.git@cld3-installer

And that's it. I tested it through bitextor-warc2htmlwarc.py and Python interpreter:

python3
>>> from pdfextract.extract import Extractor
>>> file = open("/home/cs-pla1/forcada16j.pdf","rb")
>>> pdfcontent = file.read()
>>> extractor = Extractor(pdf=pdfcontent)
>>> extractor.getHTML()

If you find this process right, please close the issue.

jelmervdl commented 4 years ago

There might be another way: There are newer versions of cmake available on CSD3 using module load cmake (or module avail cmake to list them). I've been using that to compile most of the C++ code.

I've also seen a version of maven (I think version 3.5) being available using the same method.

dionwiggins commented 4 years ago

Closing as already running on CSD3.