ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
479 stars 33 forks source link

sampling in-batch negatives #30

Closed raghavlite closed 2 months ago

raghavlite commented 2 months ago

In the paper, you mention that you sample in batch examples from the same dataset. In training/run.py, you are concatenating all datasets here.

Is there any other location in the code where you specify to sample in batch negatives from the same dataset?

Muennighoff commented 2 months ago

Note that the lenghts of each dataset are saved right above.

They are then used to do the sampling in this class: https://github.com/ContextualAI/gritlm/blob/a122855d6578a4f0980ea20340d5c9e1dd59d8c4/gritlm/training/data.py#L284

raghavlite commented 2 months ago

thanks