facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.81k stars 561 forks source link

Data set generator Step2 and 3 #121

Open YazmineAbbaszadegan opened 1 year ago

YazmineAbbaszadegan commented 1 year ago

In step2 it assumes we have a .tex file for our scanned pdfs to then proceed to get the xml format. I dont see the logic of a scanned pdf having a .tex file as it is not a digital text in it

---> step 2- A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure

step3 and its links do not explain binary.jar well

---> step3 -A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"

comment: it would be great if there was a notebook demonstrating how to get step2 and 3 for data generation

lukas-blecher commented 1 year ago

we train with arxiv papers that have both PDFs and latex source code. I'm not sure I understand your question here.

the binary.jar is the executable of pdffigures2. You have to follow their build instructions to get it. And then you can use it in the next steps.

YazmineAbbaszadegan commented 1 year ago

thank you lukas. what if our pdfs dont have latex source code?

OrianeN commented 12 months ago

I struggled with Step 3 (building JAR for pdffigures2), so I'm posting what I did in case it can help others as well:

  1. Install Scala (followed instruction from https://docs.scala-lang.org/getting-started/index.html#using-the-scala-installer-recommended-way): curl -fL https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup - you might need to relaunch your terminal or source ~\.profile to apply the PATH updates
  2. git clone pdffigures2 - as there is currently an issue with the sbt assembly, I cloned this fork: https://github.com/yasithdev/pdffigures2/tree/master
  3. Run sbt assembly - this should create a file called pdffigures2.jar in your working folder.
OrianeN commented 12 months ago

As for Step 2:

linxid commented 11 months ago

Thx for your solution. I have a error when latexmlc --dest=path.html path.tex, dont have image.

latexmlc (LaTeXML version 0.8.7)
processing started Thu Nov  2 11:46:11 2023
Warning:missing_file:algorithmic Can't find package algorithmic at algorithmic.sty.ltxml; line 26
Warning:unexpected:\end{document} Attempt to end document with open groups, environments or conditionals at main.tex; line 459 col 0
Warning:not_parsed:UNKNOWN.POSTSUBSCRIPT>OPEN MathParser failed to match rule 'Anything' at main.tex; line 149 col 20
Warning:not_parsed:UNKNOWN.POSTSUBSCRIPT>OPEN MathParser failed to match rule 'Anything' at main.tex; line 149 col 82
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 18
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 27
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 36
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 46
Warning:not_parsed:UNKNOWN.UNKNOWN>OPEN MathParser failed to match rule 'Anything' at main.tex; line 236 col 59
Warning:not_parsed:UNKNOWN.POSTSUBSCRIPT>OPEN MathParser failed to match rule 'Anything' at main.tex; line 349 col 0
Conversion complete: 10 warnings; 1 missing file[algorithmic.sty] (See /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/main.latexml.log)
Warning:missing_file:algorithmic Can't find package algorithmic at algorithmic.sty.ltxml; line 26
latexmlc (LaTeXML version 0.8.7)
recursive processing started Thu Nov  2 11:46:21 2023
recursive Conversion complete: No obvious problems
Status:conversion:0
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Autoen_Tra.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Autoen_Tra.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Autoen_Tra.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Patchify_images.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Patchify_images.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Patchify_images.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/AggreWin.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/AggreWin.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/AggreWin.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Parameter_network.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Parameter_network.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Figures/Parameter_network.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/newglobal.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/newglobal.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/newglobal.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Dualhyper.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Dualhyper.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Dualhyper.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Ablationstudy.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Ablationstudy.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Ablationstudy.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Comparisoon.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Comparisoon.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/Comparisoon.pdf
Error:imageprocessing:Read Image processing operation Read (/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/pics_kimio.pdf) returned Exception 499: not authorized `/home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/pics_kimio.pdf' @ error/constitute.c/ReadImage/412
Warning:expected:image Couldn't get usable image for /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/latex_uncompressed/2309.10799/pics_kimio.pdf
Post-processing complete: 9 warnings; 9 errors (See /home/ma-user/work/yimingcai/docagent/datasets/pdf_transform/main.latexml.log)
Status:conversion:2
KartavyaBagga commented 10 months ago

thank you lukas. what if our pdfs dont have latex source code?

If your PDF doesn't have LaTeX, what method you implmented for training?