Open sgdescent opened 4 months ago
We initialize the encoder and decoder weights with pretrained model weights. Then we train on a data mix including PMC with simple non-markup targets for layout diversity. So yes, it is a simple OCR pretraining objective. In the following step the IDL data is removed from the training.
Thanks a ton, this is helpful!
@lukas-blecher one more thing, this data mix that you mention from what I understood it has PMC and IDL data and this simple non-markup data and when you want to train the model completely you use PMC + Arxiv with full markup abilities that are generated by the dataset generation code?
yes, sounds right. With most of the weight on arxiv since it is the cleanest source. PMC's math is not always in a parsable format (eg images) or inline math is just italic
@lukas-blecher Thank you for your response but I am still unclear about the whole procedure for pretraining the model
Hey, I was looking into the paper as I want to replicate the work. In the data preparation step it is mentioned PMC and IDL data is used for pre-training the model, are both data sources used for simple next token prediction for OCR capabilities? Is there something I am missing?