Closed nkwook closed 4 years ago
There is a 1 hour timeout for inference of the test dataset.
If the inference does not finish within an hour, the above error will occur.
Yeah we realized that inference didnt finish.
Btw, by submitting with same model,
some teammates get fail while the others can submit it.
So we're such confused that submission ends before starting inference
Wasn't there any similar cases occured before this one?
Thank you.
I also met this problem for two days. The problem often occurs for two days and I could not submit any checkpoints this morning and yesterday morning.
Please tell us the session name or command that submit was successful and the command or session name of those that failed.
I just tried: "nsml submit kaist_2/korquad-open-ldbd/92 MIP_v5_best_2", and it does not give any responses for 15 minutes. The log is:
nsml submit kaist_2/korquad-open-ldbd/92 MIP_v5_best_2
.......
Building docker image. It may take a while
.........
nsml submit kaist_2/korquad-open-ldbd/53 MIP_v1_gs51000_e4
.......
Building docker image. It may take a while
.........
Could you please solve this kind of problem?
nsml submit kaist_2/korquad-open-ldbd/157 MIP
is killed because OOM problem
nsml submit kaist_2/korquad-open-ldbd/92 MIP_v5_best_0
is killed because session was interrupted while evaluating (maybe you send interrupt signal (ctrl +c) )
And I found in some sessions, a case that seemed to stop during submit was found. I guess it is because device or server problem I don't know what the exact problem is, but I am looking for it. Thank you.
There are some sessions I have to send the interrupt signal because they do not give any responses for a long time, and they will be killed after the timeout.
Thank you!
Me and my teammate have the same problem recent days, including very slow nsml homepage access. We hope this will be fixed soon. Thank you in advance.
Hello, we're also facing the same issue where our submit session cannot load a model for a long time. We used same nsml load/infer functions as in the baseline and model is successfully saved during the train time, but submitted session fails to load it for a long time. What could be possible problems?
nsml submit kaist_15/korquad-open-ldbd/310 electra_best
.......
Building docker image. It may take a while
.........
We also get this message and loading model never completes.
Me and my teammate have the same problem recent days, including very slow nsml homepage access. We hope this will be fixed soon. Thank you in advance.
The slow problem with the website was fixed in the morning. We apologize for any inconvenience.
nsml submit kaist_1/korquad-open-ldbd/285 best
.......
Building docker image. It may take a while
.........
We also have the same problem while submitting.
I met a new problem. When I tried to submit a session which I already successfully submitted before, I met the error:
Building docker image. It may take a while
.......Error: Session error: "An error occurred somewhere in your code. You can check the error with 'nsml submit --test'."
FATA[2020/06/13 22:33:40.887] Internal server error
Then, I tried to re-submit with --test
option, and I saw the error:
AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'
I also tried to submit another session and this problem still occurs. This problem just occurred today.
This problem occurs even when I try with the baseline code.:
nsml submit kaist_2/korquad-open-ldbd/2 bert_best
Now I got same error in training too.
Traceback (most recent call last):
File "run_squad_electra.py", line 85, in <module>
(),
File "run_squad_electra.py", line 84, in <genexpr>
(tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, RobertaConfig, XLNetConfig, XLMConfig)),
AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'
User session exited
Errors related to submit
have been fixed yesterday.
As the bug is fixed, all newly started sessions download the latest version of the package by installing a new package rather than using the existing one, but it causes that a bug occurred because the code was not compatible with the new packages especially transformer
.
You can fix this bug by freezing the version of the package in setup.py
(or requirements.txt
)
for example
# nsml: nsml/ml:cuda10.1-cudnn7-pytorch1.3keras2.3
from distutils.core import setup
setup(
name='kaist-korquad-test',
version='1.0',
install_requires=[
'boto3', 'regex', 'sacremoses', 'filelock', 'tokenizers',
'tqdm', 'konlpy', 'sentencepiece', 'dataclasses', 'transformers==2.10.0'
]
)
Sorry for inconvenience Thank you
Hello? we are trying to submit our model in NLP task.
Our teammates are getting problem continuously in model loading in nsml for all models.
Here is the error screen.
Thank u for ur kind respond!