question about the training data

breezedeus / Coin-CLIP

Coin-CLIP: fine-tuned with a vast collection of coin images from CLIP using contrastive learning. It enhances feature extraction for coins, boosting image search accuracy. This model merges Visual Transformer (ViT) with CLIP's multimodal learning, optimized for numismatic applications.

https://huggingface.co/spaces/breezedeus/USA-Coin-Retrieval

Apache License 2.0

11 stars 1 forks source link

question about the training data #2

Open TianjinTeda opened 1 month ago

TianjinTeda commented 1 month ago

Hi,

Thanks for this amazing work, really appreciate it!

I am wondering what your training data look like. My understanding is you have a coin dataset containing only coin images without text, and for each type of coin you have images different in perspective, light conditioning etc, and you used this dataset to conduct contrastive learning on the visual encode of the pretrained clip model. Is that correct?

breezedeus commented 1 month ago

Thanks. The training uses image-text pairs, and the text is the description of the contents of the coin.

youthhou1992 commented 1 week ago

Thanks. The training uses image-text pairs, and the text is the description of the contents of the coin.

Hello, your work is excellent. I would like to make some fine-tuning based on your work. Could you please share how you obtained the text for the image-text pairing?