Training dataset (sample)

Marplex / mcdse

Multilingual model for OCR-free document retrieval

MIT License

2 stars 0 forks source link

Training dataset (sample) #1

Closed maxjeblick closed 8 hours ago

maxjeblick commented 3 days ago

Thanks a lot for open sourcing the repo, including the training code. Do you plan to publish the training datasets? If not, would it be possible to release a small sample/dummy dataset that is in the format used for training?

Marplex commented 8 hours ago

Hi @maxjeblick , I think I will publish the training data in the next version. I'm planning to make a bigger and better dataset.

If you want to train with your own dataset, the training script expects two parquet files: train.parquet and corpus.parquet

train.parquet has two columns: q and pos. q is the query, pos is the id of the image.
corpus.parquet has two columns: docid and image. docid is the id of the image, image is the image :)

maxjeblick commented 8 hours ago

Thanks for the clarification!