chenzhengdeeplearning commented 3 years ago

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. docker: Error response from daemon: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown.

VisualJoyce commented 3 years ago

This is because your machine is not using latest driver.

Please try this tag of docker instead, visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel.

chenzhengdeeplearning commented 3 years ago

sorry, how to use visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel ?

VisualJoyce commented 3 years ago

diff --git a/docker_train.sh b/docker_train.sh
index 3788e15..b21df38 100644
--- a/docker_train.sh
+++ b/docker_train.sh
@@ -45,6 +45,6 @@ docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it \
   --mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
   --mount src="$TXT_DB",dst=/txt,type=bind \
   -e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
-  -w /src visualjoyce/chengyubert:latest \
+  -w /src visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel \
   bash -c " PYTHONPATH=/src ${MODEL_PARA} ${HOROVOD_PARA} \\
     python train_${SUB_PROJECT}.py --config=$CONFIG_DIR/$CONFIG_FILE"

chenzhengdeeplearning commented 3 years ago

docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it \ --mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \ --mount src="$TXT_DB",dst=/txt,type=bind \ -e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \

-w /src visualjoyce/chengyubert:latest \
-w /src visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel \ bash -c " PYTHONPATH=/src ${MODEL_PARA} ${HOROVODPARA} \ python train${SUB_PROJECT}.py --config=$CONFIG_DIR/$CONFIG_FILE" This part is in docker_train.sh

But,What is this mean? diff --git a/docker_train.sh b/docker_train.sh index 3788e15..b21df38 100644 --- a/docker_train.sh +++ b/docker_train.sh @@ -45,6 +45,6 @@

chenzhengdeeplearning commented 3 years ago

It was this before

--mount src="${WORK_DIR}",dst=/src,type=bind \
  --mount src="$OUTPUT",dst=/storage,type=bind \
  --mount src="$PRETRAIN_DIR",dst=/pretrain,type=bind,readonly \
  --mount src=$ANNOTATION_DIR,dst=/annotations,type=bind,readonly \
--mount src="$TXT_DB",dst=/txt,type=bind \

Now is that

--mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
  --mount src="$TXT_DB",dst=/txt,type=bind \

Is my understanding right? 3 rows are deleted

VisualJoyce commented 3 years ago

A diff file shows editing made, the line starts with - is removed, + means added.

My last post means, you need to change visualjoyce/chengyubert:latest to visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel.

chenzhengdeeplearning commented 3 years ago

Thanks for your patience.

parser.add_argument("--model", default='paired',
                        choices=['snlive'],
                        help="choose from 2 model architecture")

What are paired and snlive mean?

Here is my another error.

 File "train_official.py", line 304, in main
    raise ValueError(f"No such model [{opts.model}] supported!")
ValueError: No such model [paired] supported!

VisualJoyce commented 3 years ago

This is due to copy-paste from an earlier code. Now you may ignore the parameter as it's overwritten by MODEL=chengyubert-dual in the command line.

I will fix that on my next version.

chenzhengdeeplearning commented 3 years ago

Sorry that there are so many errors I met... Another error is that. I even don't know why it occurs, because there is only one process in my computer.

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated.

VisualJoyce commented 3 years ago

That's OK. Thank you for pointing these out!

The code supports multiple GPU. If you only have one GPU, then use CUDA_VISIBLE_DEVICES=0.

chenzhengdeeplearning commented 3 years ago

ok! And what is 'len_idiom_vocab'?

VisualJoyce commented 3 years ago

len_idiom_vocab is 3848 for ChID dataset.

Our next paper is under review which supports more than 30k idioms.

So it's a parameter for future compatibility.

chenzhengdeeplearning commented 3 years ago

The mode I choose is 'train'.

Your code in train_official.py is this if opts.mode == 'train':

data loaders

    splits, dataloaders = create_dataloaders(LOGGER, DatasetCls, EvalDatasetCls,
                                             collate_fn, eval_collate_fn, opts, splits=['train', 'val'])
    best_ckpt = train(model, dataloaders, opts)
else:
    splits = []
    for k in dir(opts):
        if k.endswith('_txt_db'):

The error is that

AttributeError: 'Namespace' object has no attribute 'train_txt_db'

VisualJoyce commented 3 years ago

I can run the code on my machine with

CUDA_VISIBLE_DEVICES=0 CONFIG_FILE="train-official-bert-base-1gpu.json" \
bash docker_train.sh official \
"MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"

Can you post the command line?

chenzhengdeeplearning commented 3 years ago

This is mine. CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE="train-official-bert-base-1gpu.json" bash docker_train.sh official "MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"

And I run your command, it has the same error... Is my data directory losing something? I don't have train_txt_db..

()

VisualJoyce commented 3 years ago

Have you done the preprocessing without error?

chenzhengdeeplearning commented 3 years ago

Oh, I am so sorry for making wrong with that before. Thanks for your reminding!

VisualJoyce commented 3 years ago

Also, your data/pretrained structure is not the same with the documentation.

chenzhengdeeplearning commented 3 years ago

Copy that

chenzhengdeeplearning commented 3 years ago

After preprocessing, I have the error too....

chenzhengdeeplearning commented 3 years ago

My directories have locked, is that right?

VisualJoyce commented 3 years ago

Locked is because they are generated from docker using root. This should be working.

chenzhengdeeplearning commented 3 years ago

ok, And..

[1,1]<stderr>:    raise EnvironmentError(msg)
[1,1]<stderr>:OSError: Can't load config for './pretrain/wwm_ext'. Make sure that:
[1,1]<stderr>:
[1,1]<stderr>:- './pretrain/wwm_ext' is a correct model identifier listed on 'https://huggingface.co/models'
[1,1]<stderr>:
[1,1]<stderr>:- or './pretrain/wwm_ext' is the correct path to a directory containing a config.json file

VisualJoyce commented 3 years ago

Do you have data/pretrained/wwm_ext? If not, you need to download BERT-wwm-ext from Chinese-BERT-wwm.

VisualJoyce commented 3 years ago

Or change the value of pretrained_model_name_or_path in the config file to hfl/chinese-bert-wwm-ext.

chenzhengdeeplearning commented 3 years ago

It did work! My computer doesn't have enough cuda memories. So I change the values of train batch size and val batch size to 2048. And.. subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.

What is that mean?

VisualJoyce commented 3 years ago

You need to paste the full log, it's hard to tell where the problem might be. When you post the log, try using code block to show the log in a user-friendly format.

chenzhengdeeplearning commented 3 years ago

[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Waiting on git info....
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Git branch: 
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Git SHA: 
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train_official.py", line 468, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "train_official.py", line 317, in main
[1,0]<stderr>:    best_ckpt = train(model, dataloaders, opts)
[1,0]<stderr>:  File "train_official.py", line 49, in train
[1,0]<stderr>:    save_training_meta(opts)
[1,0]<stderr>:  File "/src/chengyubert/utils/save.py", line 45, in save_training_meta
[1,0]<stderr>:    cwd=git_dir, universal_newlines=True).strip()
[1,0]<stderr>:  File "/opt/conda/lib/python3.7/subprocess.py", line 411, in check_output
[1,0]<stderr>:    **kwargs).stdout
[1,0]<stderr>:  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
[1,0]<stderr>:    output=stdout, stderr=stderr)
[1,0]<stderr>:subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.

Thanks you very much!

VisualJoyce commented 3 years ago

Are you using a cloned repo or downloaded the zip from master?

chenzhengdeeplearning commented 3 years ago

Yes, I downloaded the zip before

VisualJoyce commented 3 years ago

The code is trying to query the git info and failed. Either you clone the repo or comment out the line which is trying to query git status.

chenzhengdeeplearning commented 3 years ago

It finally works! Thank you sooooo much!!!

VisualJoyce commented 3 years ago

Glad that works! I will add more details on next release.

VisualJoyce / ChengyuBERT

Problems met when trying the code #2

data loaders