Closed erjui closed 4 years ago
'segment related error' is mostly a memory related error.
When you type nsml run
, allocate more memory with the --memory
option or code it in a way that uses less memory in the code (for example, to reduce the batch size).
'segment related error' is mostly a memory related error.
When you type
nsml run
, allocate more memory with the--memory
option or code it in a way that uses less memory in the code (for example, to reduce the batch size).
Umm.. I am not sure, but I think it's not the problem of running out of memory. For clarifying this. I slightly modified source code that it only runs 1 epoch with batch size 128. And I add a print message after main function ends like follows.
But the result was segmentation fault again as follows.
Since "the program ends" log appeared in the terminal log. I carefully guess my code part itself ran successfully.
After all, model was successfully saved as the message shows though segmentation fault happened. So I also tried to submit the model into the Leaderborad. Then also segmentation fault happened as follows.
Even testing the result with successfully saved model failed with segmentation fault...
Thanks in advance!
It looks like you have a problem using an older version of Python or docker. Please refer to here and replace the default docker image with nvcr.io/nvidia/pytorch:20.03-py3 and try again.
Hi, sorry to ask this kind of simple question. But I'm asking this because I'm having a hard time even running the baseline code. I just ran the baseline code to train the model as follows.
nsml -d fashion_eval -e main.py
And I got the following terminal log as a result.
First, I want to ask why the segmentation fault happened even after running the baseline code that I didn't modify... (I tried to fix on my own but couldn't find it, and I think it would be better to ask here because many teams already succeed to run the baseline code already!)
Despite the segmentation fault, since the model itself was successfully saved, I tried to submit with the following command too.
nsml submit -t kaist_9/fashion_eval/2 Res18baseMM_best
But I also ran into the following error....
So I want to ask how to solve this question.
Thanks in advance, and sorry for asking this kind of simple question.