torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary:

AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.

https://selfrag.github.io/

MIT License

1.59k stars 140 forks source link

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: #83

Open zhongruizhe123 opened 1 week ago

zhongruizhe123 commented 1 week ago

I encountered the following error while training on a single GPU: torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 14447) of binary:

I tried to adjust the training parameter: --nproc_per_node=1, but only local_rank changed here torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary:

zhongruizhe123 commented 1 week ago

I have found the problem because the memory is not enough

zhongruizhe123 commented 1 week ago

I have found the problem because the memory is not enough