🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
I met empty net_input at the tail of an epoch training.
But the logic in collate_fn seems that it at least has input_ids and attention_masks.
I has not found why the net_input is empty.
but when i delete delete_tensors_from_dict in my debugging script (pass the forward/backward process, only keep data loading to speed up err raising), the bug disappears.
It is strange.
Additionally, the actual bug is nccl timeout ...[Rank x] Watchdog caught collective operation timeout: WorkNCCL.... I found the true err by adding this:
Hi, it seems that the
pop
anddelete_tensors_from_dict
operation results in missing of some samples.What is this operation for?
The operations includes:
I met empty
net_input
at the tail of an epoch training.But the logic in collate_fn seems that it at least has
input_ids
andattention_masks
.I has not found why the
net_input
is empty.but when i delete delete_tensors_from_dict in my debugging script (pass the forward/backward process, only keep data loading to speed up err raising), the bug disappears.
It is strange.
Additionally, the actual bug is nccl timeout
...[Rank x] Watchdog caught collective operation timeout: WorkNCCL...
. I found the true err by adding this:Maybe cycle(dataloader) makes some sample appeared again? but their value was deleted by the func?