Open YazmineAbbaszadegan opened 1 year ago
we train with arxiv papers that have both PDFs and latex source code. I'm not sure I understand your question here.
the binary.jar is the executable of pdffigures2. You have to follow their build instructions to get it. And then you can use it in the next steps.
thank you lukas. what if our pdfs dont have latex source code?
I struggled with Step 3 (building JAR for pdffigures2), so I'm posting what I did in case it can help others as well:
curl -fL https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup
- you might need to relaunch your terminal or source ~\.profile
to apply the PATH updatessbt assembly
, I cloned this fork: https://github.com/yasithdev/pdffigures2/tree/mastersbt assembly
- this should create a file called pdffigures2.jar
in your working folder.As for Step 2:
apt-get install -y latexml
latexmlc --dest=path.html path.tex
Thx for your solution. I have a error when latexmlc --dest=path.html path.tex
, dont have image.
latexmlc (LaTeXML version 0.8.7)
processing started Thu Nov 2 11:46:11 2023
Warning:missing_file:algorithmic Can't find package algorithmic at algorithmic.sty.ltxml; line 26
Warning:unexpected:\end{document} Attempt to end document with open groups, environments or conditionals at main.tex; line 459 col 0
Warning:not_parsed:UNKNOWN.POSTSUBSCRIPT>OPEN MathParser failed to match rule 'Anything' at main.tex; line 149 col 20
Warning:not_parsed:UNKNOWN.POSTSUBSCRIPT>OPEN MathParser failed to match rule 'Anything' at main.tex; line 149 col 82
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 18
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 27
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 36
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 46
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 59
Warning:not_parsed:UNKNOWN.POSTSUBSCRIPT>OPEN MathParser failed to match rule 'Anything' at main.tex; line 349 col 0
Conversion complete: 10 warnings; 1 missing file[algorithmic.sty] (See /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/main.latexml.log)
Warning:missing_file:algorithmic Can't find package algorithmic at algorithmic.sty.ltxml; line 26
latexmlc (LaTeXML version 0.8.7)
recursive processing started Thu Nov 2 11:46:21 2023
recursive Conversion complete: No obvious problems
Status:conversion:0
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Autoen_Tra.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Autoen_Tra.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Autoen_Tra.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Patchify_images.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Patchify_images.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Patchify_images.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/AggreWin.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/AggreWin.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/AggreWin.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Parameter_network.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Parameter_network.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Parameter_network.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/newglobal.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/newglobal.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/newglobal.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Dualhyper.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Dualhyper.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Dualhyper.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Ablationstudy.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Ablationstudy.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Ablationstudy.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Comparisoon.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Comparisoon.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Comparisoon.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/pics_kimio.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/pics_kimio.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/pics_kimio.pdf
Post-processing complete: 9 warnings; 9 errors (See /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/main.latexml.log)
Status:conversion:2
thank you lukas. what if our pdfs dont have latex source code?
If your PDF doesn't have LaTeX, what method you implmented for training?
In step2 it assumes we have a .tex file for our scanned pdfs to then proceed to get the xml format. I dont see the logic of a scanned pdf having a .tex file as it is not a digital text in it
---> step 2- A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure
step3 and its links do not explain binary.jar well
---> step3 -A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"
comment: it would be great if there was a notebook demonstrating how to get step2 and 3 for data generation