Dataset creation: Do we expect the .tex files to be just a single file for each corresponding PDF?

facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents

https://facebookresearch.github.io/nougat/

MIT License

8.81k stars 560 forks source link

Dataset creation: Do we expect the .tex files to be just a single file for each corresponding PDF? #198

Closed sgdescent closed 7 months ago

sgdescent commented 7 months ago

Hey,

Thank you for your work, the results are pretty cool!

I am trying to reproduce your work for a personal project, but it seems for dataset generation many times in the src downloaded from Arxiv there are many .tex files, I can always merge into a single .tex files maybe using a simple script, so my question is do you expect only a single tex file converted to HTML using LATExml for each pdf?

Best, Saksham

lukas-blecher commented 7 months ago

We are using engrafo to convert latex project directories to a single html file

sgdescent commented 7 months ago

Thanks this was really helpful!