Closed JasonLo closed 3 months ago
Running Location to Mineral extraction pipeline on full dataset on CHTC, ETA: June 13.
NCCL error in some jobs. Will only work on A100-80GB GPUs? (Previous batch worked with the same setup, unsure why this issue occurred.) Resubmitted with this additional restriction.
Hmm when was the previous batch run? And what is the error? We rolled out a small config change to how docker starts up ~2 weeks ago. I'd be shocked if that was the culprit but it's a bit suspicious...
Previous run was like 6 weeks ago? roughly
The error involves the NCCL driver. It seems more related to vllm
than CHTC, or possibly their interaction. Using NCCL with a single GPU job doesn't make sense.
Yeah some unexpected configuration or interaction in the CHTC setting is my main concern. It doesn’t really make much sense but these kinds of things have surprised me in the past.
On Thu, Jun 13 2024 at 11:37 AM, Jason Lo @.**@.>> wrote:
The error involves the NCCL driver. It seems more related to vllm than CHTC, or possibly their interaction. Using NCCL with a single GPU job doesn't make sense.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/UW-xDD/text2graph_llm/issues/51*issuecomment-2166183933__;Iw!!Mak6IKo!JVLyUT-mezNOkH5H4EL8dmn4FCM037r3bmDxT8pKK99pZceAwUn4CpEOloFejPiC7GW0ievNao7yqiY1g-dDcE8h$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AALAW7OOIKK3FS5EEQMNHQDZHHDFBAVCNFSM6AAAAABJEXVSBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWGE4DGOJTGM__;!!Mak6IKo!JVLyUT-mezNOkH5H4EL8dmn4FCM037r3bmDxT8pKK99pZceAwUn4CpEOloFejPiC7GW0ievNao7yqiY1g1JK-013$. You are receiving this because you commented.Message ID: @.***>
Preprocess completed.
Run extraction pipeline on CHTC