Model load error while submission

nkwook commented 4 years ago

Hello? we are trying to submit our model in NLP task.

Our teammates are getting problem continuously in model loading in nsml for all models.

Here is the error screen.

Thank u for ur kind respond!

nsml-admin commented 4 years ago

There is a 1 hour timeout for inference of the test dataset.

If the inference does not finish within an hour, the above error will occur.

nkwook commented 4 years ago

Yeah we realized that inference didnt finish.

Btw, by submitting with same model,

some teammates get fail while the others can submit it.

So we're such confused that submission ends before starting inference

Wasn't there any similar cases occured before this one?

Thank you.

trtrung commented 4 years ago

I also met this problem for two days. The problem often occurs for two days and I could not submit any checkpoints this morning and yesterday morning.

nsml-admin commented 4 years ago

Please tell us the session name or command that submit was successful and the command or session name of those that failed.

trtrung commented 4 years ago

nsml submit kaist_2/korquad-open-ldbd/157 MIP: this command was failed yesterday evening. Then, I tried to re-submit it and it was successful.
nsml submit kaist_2/korquad-open-ldbd/92 MIP_v5_best_0: this command was failed this morning.
nsml submit kaist_2/korquad-open-ldbd/68 MIP_v5_last: this command was also failed, but it just was successful one hour ago.

trtrung commented 4 years ago

I just tried: "nsml submit kaist_2/korquad-open-ldbd/92 MIP_v5_best_2", and it does not give any responses for 15 minutes. The log is:

nsml submit kaist_2/korquad-open-ldbd/92 MIP_v5_best_2
.......
Building docker image. It may take a while
.........

trtrung commented 4 years ago

nsml submit kaist_2/korquad-open-ldbd/53 MIP_v1_gs51000_e4
.......
Building docker image. It may take a while
.........

Could you please solve this kind of problem?

nsml-admin commented 4 years ago

nsml submit kaist_2/korquad-open-ldbd/157 MIP is killed because OOM problem nsml submit kaist_2/korquad-open-ldbd/92 MIP_v5_best_0 is killed because session was interrupted while evaluating (maybe you send interrupt signal (ctrl +c) )

And I found in some sessions, a case that seemed to stop during submit was found. I guess it is because device or server problem I don't know what the exact problem is, but I am looking for it. Thank you.

trtrung commented 4 years ago

There are some sessions I have to send the interrupt signal because they do not give any responses for a long time, and they will be killed after the timeout.

Thank you!

tmddus49 commented 4 years ago

Me and my teammate have the same problem recent days, including very slow nsml homepage access. We hope this will be fixed soon. Thank you in advance.

komfkore commented 4 years ago

Hello, we're also facing the same issue where our submit session cannot load a model for a long time. We used same nsml load/infer functions as in the baseline and model is successfully saved during the train time, but submitted session fails to load it for a long time. What could be possible problems?

komfkore commented 4 years ago

nsml submit kaist_15/korquad-open-ldbd/310 electra_best
.......
Building docker image. It may take a while
.........

We also get this message and loading model never completes.

bluebrush commented 4 years ago

Me and my teammate have the same problem recent days, including very slow nsml homepage access. We hope this will be fixed soon. Thank you in advance.

The slow problem with the website was fixed in the morning. We apologize for any inconvenience.

TaeryungLee commented 4 years ago

nsml submit kaist_1/korquad-open-ldbd/285 best ....... Building docker image. It may take a while .........

We also have the same problem while submitting.

trtrung commented 4 years ago

I met a new problem. When I tried to submit a session which I already successfully submitted before, I met the error:

Building docker image. It may take a while
.......Error: Session error: "An error occurred somewhere in your code. You can check the error with 'nsml submit --test'."
FATA[2020/06/13 22:33:40.887] Internal server error

Then, I tried to re-submit with --test option, and I saw the error:

AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'

I also tried to submit another session and this problem still occurs. This problem just occurred today. This problem occurs even when I try with the baseline code.: nsml submit kaist_2/korquad-open-ldbd/2 bert_best

tmddus49 commented 4 years ago

Now I got same error in training too.

Traceback (most recent call last):
  File "run_squad_electra.py", line 85, in <module>
    (),
  File "run_squad_electra.py", line 84, in <genexpr>
    (tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, RobertaConfig, XLNetConfig, XLMConfig)),
AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'
User session exited

nsml-admin commented 4 years ago

Errors related to submit have been fixed yesterday. As the bug is fixed, all newly started sessions download the latest version of the package by installing a new package rather than using the existing one, but it causes that a bug occurred because the code was not compatible with the new packages especially transformer.

You can fix this bug by freezing the version of the package in setup.py (or requirements.txt)

for example

# nsml: nsml/ml:cuda10.1-cudnn7-pytorch1.3keras2.3

from distutils.core import setup

setup(
    name='kaist-korquad-test',
    version='1.0',
    install_requires=[
        'boto3', 'regex', 'sacremoses', 'filelock', 'tokenizers',
        'tqdm', 'konlpy', 'sentencepiece', 'dataclasses', 'transformers==2.10.0'
    ]
)

Sorry for inconvenience Thank you

Naver-AI-Hackathon / cs492I

Model load error while submission #34