KevinMusgrave / pytorch-metric-learning

The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.
https://kevinmusgrave.github.io/pytorch-metric-learning/
MIT License
6k stars 658 forks source link

Addition of popular benchmark datasets #722

Open ir2718 opened 1 week ago

ir2718 commented 1 week ago

Hi,

I find that it's nice to have a few benchmark datasets integrated into libraries for easier research. My feature request boils down to the implementation of a few image retrieval datasets, namely: CUB, Cars196, Stanford Online Products, and INaturalist. In most image retrieval papers, these datasets are used for benchmarking new methods and models. @KevinMusgrave, If you agree with this request, I can create a PR.

Additionally, some kind of integration with HuggingFace datasets might be nice for text retrieval/text similarity, but I'm not sure if this is of any use since sentence-transformers is probably the most often used library for such things. It also introduces an external dependency, so I'd like to hear your opinion on this.

KevinMusgrave commented 5 days ago

Thanks for the suggestions!

I find that it's nice to have a few benchmark datasets integrated into libraries for easier research. My feature request boils down to the implementation of a few image retrieval datasets, namely: CUB, Cars196, Stanford Online Products, and INaturalist. In most image retrieval papers, these datasets are used for benchmarking new methods and models. @KevinMusgrave, If you agree with this request, I can create a PR.

Would the dataset classes download the datasets? Are those datasets readily available for download these days?

Additionally, some kind of integration with HuggingFace datasets might be nice for text retrieval/text similarity, but I'm not sure if this is of any use since sentence-transformers is probably the most often used library for such things. It also introduces an external dependency, so I'd like to hear your opinion on this.

Could you give an example of how this might work?

ir2718 commented 5 days ago

Would the dataset classes download the datasets? Are those datasets readily available for download these days?

Ideally, yes, as I would like it to mimic pytorch because of familiarity. This would mean you can specify the root, split, and download (possibly something else in case I missed it). I've already implemented Cars196, and CUB on my fork, so you can have a look at what I had mind: https://github.com/ir2718/pytorch-metric-learning/tree/dataset. If you think this is a step in the right direction, do say so.

Could you give an example of how this might work?

I haven't given it that much thought, but for the sake of example, maybe a function that generates a pytorch dataset from the given huggingface dataset name, input column, and output column.

KevinMusgrave commented 4 days ago

Ideally, yes, as I would like it to mimic pytorch because of familiarity. This would mean you can specify the root, split, and download (possibly something else in case I missed it). I've already implemented Cars196, and CUB on my fork, so you can have a look at what I had mind: https://github.com/ir2718/pytorch-metric-learning/tree/dataset. If you think this is a step in the right direction, do say so.

Looks good! I don't think there's any harm in adding them, and I think some people will find it convenient. Feel free to open a PR for those dataset classes.

I haven't given it that much thought, but for the sake of example, maybe a function that generates a pytorch dataset from the given huggingface dataset name, input column, and output column.

Hmm, I don't have any thoughts on this now. We can keep this discussion open though.