Closed kpu closed 4 years ago
cmake
version in csd3 is quite old (2.8.12.2). Downloading a newer version binary fixes this issue (I downloaded latest 3.16.5 and put in my ~/bin/
, then added it to my $PATH). Also, mvn
is needed and it is not installed in CSD3, so I did the same.
Once these dependencies are fulfilled, you can install pdf-extract
Python wrapper:
python3 -m venv paracrawlenv
pip install git+https://github.com/bitextor/python-pdfextract.git@cld3-installer
And that's it. I tested it through bitextor-warc2htmlwarc.py
and Python interpreter:
python3
>>> from pdfextract.extract import Extractor
>>> file = open("/home/cs-pla1/forcada16j.pdf","rb")
>>> pdfcontent = file.read()
>>> extractor = Extractor(pdf=pdfcontent)
>>> extractor.getHTML()
If you find this process right, please close the issue.
There might be another way: There are newer versions of cmake available on CSD3 using module load cmake
(or module avail cmake
to list them). I've been using that to compile most of the C++ code.
I've also seen a version of maven (I think version 3.5) being available using the same method.
Closing as already running on CSD3.
You have the instructions to get a login on CSD3. Please produce a working version on there.
Presumably depends on #22.