Closed maxjeblick closed 8 hours ago
Hi @maxjeblick , I think I will publish the training data in the next version. I'm planning to make a bigger and better dataset.
If you want to train with your own dataset, the training script expects two parquet files: train.parquet
and corpus.parquet
train.parquet
has two columns: q and pos. q is the query, pos is the id of the image.corpus.parquet
has two columns: docid and image. docid is the id of the image, image is the image :)Thanks for the clarification!
Thanks a lot for open sourcing the repo, including the training code. Do you plan to publish the training datasets? If not, would it be possible to release a small sample/dummy dataset that is in the format used for training?