LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.36k stars 133 forks source link

did anyone train on your own dataset and got good performance? #76

Open MisakaMikoto96 opened 1 year ago

MisakaMikoto96 commented 1 year ago

did anyone train on your own dataset and got good performance?

Hi, I wanna train on my own data but seems there is some dependence on datasets like "batch['url']" in train.py: https://github.com/LAION-AI/CLAP/blob/main/src/laion_clap/training/train.py#L326

how the 'url' can be set in my own data? what is this stand for(FOR class label?)? could u please give me an example?

thanks for your time!

RetroCirce commented 1 year ago

@tianyu-z Please show how to construct the local data by using our webdataset loader.

tianyu-z commented 1 year ago

Hi, @MisakaMikoto96. Thanks for being interested in our work. To use your own dataset. Please do NOT include remotedata in your training command args. And please make sure that the args.datasetpath points to the correct path. Please let me know if any further issue is out. Besides, the local data you have also needs to be webdataset tar. Please make sure you pack the text and the corresponding audio in the tar in a pairwise order. If you shuffle the order, you might end up with inconsistency data loading. You may refer to these scripts(1, 2) to pack up webdatset tars. Thank you!

lukewys commented 1 year ago

Hi @MisakaMikoto96 . Thanks for your message. The 'url' is the path to the tar the current sample comes from. The URL is based on webdataset format. So if you make your data in webdataset format, it will also have 'url' pointing to the path of the tar. If you want to use other dataset formats, unfortunately, you will need to write your own dataloader and replace those keys in training scripts or create the same key in your dataloader.

As for the reproducible performance of the model, we admit it is a bit hard to reproduce the exact result since we cannot provide raw data (because of copyright issues). With that said, we provided the Clotho dataset in webdataset format, the training script on Clotho dataset alone, and a single-GPU training log example for reference. For details, please see https://github.com/LAION-AI/CLAP#reproducibility.

MisakaMikoto96 commented 1 year ago

Hi @MisakaMikoto96 . Thanks for your message. The 'url' is the path to the tar the current sample comes from. The URL is based on webdataset format. So if you make your data in webdataset format, it will also have 'url' pointing to the path of the tar. If you want to use other dataset formats, unfortunately, you will need to write your own dataloader and replace those keys in training scripts or create the same key in your dataloader.

As for the reproducible performance of the model, we admit it is a bit hard to reproduce the exact result since we cannot provide raw data (because of copyright issues). With that said, we provided the Clotho dataset in webdataset format, the training script on Clotho dataset alone, and a single-GPU training log example for reference. For details, please see https://github.com/LAION-AI/CLAP#reproducibility.

Hi, @lukewys. Thanks for your reply. I rewrote the dataset and dataloader. so I only wanna know that 'all_names' here just means the basename of each utterance here?

all_names = list(set(["-".join(b.split("/")[-3:-1]) for b in batch['__url__']]))

like all_name = ['tts_0001', 'tts_0002'...]

lukewys commented 1 year ago

Well, for us, all_names is the name of all datasets the current batch comes from. This is because in evaluation the dataloader will merge different dataset together so you will have two or more dataset in one batch. For evaluation, we need to calculate metric inside each dataset, so we need to know what datasets the current batch contains.

MisakaMikoto96 commented 1 year ago

Well, for us, all_names is the name of all datasets the current batch comes from. This is because in evaluation the dataloader will merge different dataset together so you will have two or more dataset in one batch. For evaluation, we need to calculate metric inside each dataset, so we need to know what datasets the current batch contains.

okay, if I only have one kind of dataset, means that len(all_names) == 1 right?

lukewys commented 1 year ago

Yes that is correct.

MisakaMikoto96 commented 1 year ago

Yes that is correct.

@lukewys hi, could you please attend to this issue for some problem i got: https://github.com/LAION-AI/CLAP/issues/98