AmritaBh / ConDA-gen-text-detection

Code for the paper: ConDA: Contrastive Domain Adaptation for AI-generated Text Detection
MIT License
31 stars 0 forks source link

Different Batch Sizes for src_loader and tgt_loader #2

Open gymbeijing opened 9 months ago

gymbeijing commented 9 months ago

Hi Amrita, thank you for the great work!

I was trying to apply your model in the repo to my own dataset. However, while I was running the training code, I encountered an issue:

The source training data contains 29080 items, while the target training data contains 3832 items. I set batch_size to 256. Therefore, in one of the batch for both src_loader and the tgt_loader, other than having 256 as the batch size, the batch size is less than 256 and different (for tgt_loader, the remaining size is 248). Thus, the size of negatives_mask in SimCLRContrastiveLoss is not compatible with batches of different sizes (e.g. 256 in a source batch, 248 in a target batch). This will cause error in denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature). Can I ask how you tackled this issue? Thanks!

AmritaBh commented 9 months ago

Hi, thank you for using our work! Lines 263 to 269 in contrast_training_with_da.py handles this issue: https://github.com/AmritaBh/ConDA-gen-text-detection/blob/676aaf313ea9aec756a2474f6c33a22f4f1f2c1f/contrast_training_with_da.py#L263

Your case would be handled by this line: https://github.com/AmritaBh/ConDA-gen-text-detection/blob/676aaf313ea9aec756a2474f6c33a22f4f1f2c1f/contrast_training_with_da.py#L269

Basically, we re-iterate over the dataset with the smaller size. Let me know if this helps or if you have any other issues.