Marplex / mcdse

Multilingual model for OCR-free document retrieval
MIT License
2 stars 0 forks source link

Training dataset (sample) #1

Closed maxjeblick closed 8 hours ago

maxjeblick commented 3 days ago

Thanks a lot for open sourcing the repo, including the training code. Do you plan to publish the training datasets? If not, would it be possible to release a small sample/dummy dataset that is in the format used for training?

Marplex commented 8 hours ago

Hi @maxjeblick , I think I will publish the training data in the next version. I'm planning to make a bigger and better dataset.

If you want to train with your own dataset, the training script expects two parquet files: train.parquet and corpus.parquet

maxjeblick commented 8 hours ago

Thanks for the clarification!