facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.4k stars 538 forks source link

Pretraining Objectives? #207

Open sgdescent opened 4 months ago

sgdescent commented 4 months ago

Hey, I was looking into the paper as I want to replicate the work. In the data preparation step it is mentioned PMC and IDL data is used for pre-training the model, are both data sources used for simple next token prediction for OCR capabilities? Is there something I am missing?

lukas-blecher commented 4 months ago

We initialize the encoder and decoder weights with pretrained model weights. Then we train on a data mix including PMC with simple non-markup targets for layout diversity. So yes, it is a simple OCR pretraining objective. In the following step the IDL data is removed from the training.

sgdescent commented 4 months ago

Thanks a ton, this is helpful!

sgdescent commented 4 months ago

@lukas-blecher one more thing, this data mix that you mention from what I understood it has PMC and IDL data and this simple non-markup data and when you want to train the model completely you use PMC + Arxiv with full markup abilities that are generated by the dataset generation code?

lukas-blecher commented 4 months ago

yes, sounds right. With most of the weight on arxiv since it is the cleanest source. PMC's math is not always in a parsable format (eg images) or inline math is just italic

sgdescent commented 4 months ago

@lukas-blecher Thank you for your response but I am still unclear about the whole procedure for pretraining the model

  1. If the loss function is the same, why two stages? If the loss function is not the same, then what is the objectives for 1st stage and what is for the 2nd stage?
  2. If two stages, what is the training schedule? In the paper, you only mentioned training the model for 3 epochs. Is that for pretraining (stage 1) or training (stage 2)
  3. For pretraining, what sources do your data contain, is it Arxiv + PMC + IDL? and do the papers used in pretraining for example from Arxiv used again in training?
  4. For pretraining, you mentioned only use non-markup data. Does that mean you use masking to mask out the markup data to compute the losses, and if not for no-markup do you have a simple script which only keeps pages that have no-markup for training, and is this script run on the .mmd files generated by the dataset generation script?
  5. Finally for training (stage 2) are some PMC files removed based on some criteria which is applied to the .mmd file, for cases where PMC's math is not always parsable?
sgdescent commented 4 months ago

Also one more question. I looked into the IDL data the JSON object returned for each file, do you convert it into a text document by parsing the JSON response?