processing of the pre-training dataset IIT CDIP 1.0

AlibabaResearch / AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Apache License 2.0

1.35k stars 164 forks source link

processing of the pre-training dataset IIT CDIP 1.0 #82

Open kenneys-bot opened 9 months ago

kenneys-bot commented 9 months ago

Can you please provide the code used to process the pre-training dataset IIT CDIP 1.0? I am now trying to do retraining weights for use with a new encoder. Any help from the developers would be greatly appreciated.

kenneys-bot commented 9 months ago

for geolayoutlm