Closed chenzhengdeeplearning closed 3 years ago
This is because your machine is not using latest driver.
Please try this tag of docker instead, visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel
.
sorry, how to use visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel ?
diff --git a/docker_train.sh b/docker_train.sh
index 3788e15..b21df38 100644
--- a/docker_train.sh
+++ b/docker_train.sh
@@ -45,6 +45,6 @@ docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it \
--mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
--mount src="$TXT_DB",dst=/txt,type=bind \
-e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
- -w /src visualjoyce/chengyubert:latest \
+ -w /src visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel \
bash -c " PYTHONPATH=/src ${MODEL_PARA} ${HOROVOD_PARA} \\
python train_${SUB_PROJECT}.py --config=$CONFIG_DIR/$CONFIG_FILE"
docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it \ --mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \ --mount src="$TXT_DB",dst=/txt,type=bind \ -e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
But,What is this mean? diff --git a/docker_train.sh b/docker_train.sh index 3788e15..b21df38 100644 --- a/docker_train.sh +++ b/docker_train.sh @@ -45,6 +45,6 @@
It was this before
--mount src="${WORK_DIR}",dst=/src,type=bind \
--mount src="$OUTPUT",dst=/storage,type=bind \
--mount src="$PRETRAIN_DIR",dst=/pretrain,type=bind,readonly \
--mount src=$ANNOTATION_DIR,dst=/annotations,type=bind,readonly \
--mount src="$TXT_DB",dst=/txt,type=bind \
Now is that
--mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
--mount src="$TXT_DB",dst=/txt,type=bind \
Is my understanding right? 3 rows are deleted
A diff file shows editing made, the line starts with -
is removed, +
means added.
My last post means, you need to change visualjoyce/chengyubert:latest
to visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel
.
Thanks for your patience.
parser.add_argument("--model", default='paired',
choices=['snlive'],
help="choose from 2 model architecture")
What are paired and snlive mean?
Here is my another error.
File "train_official.py", line 304, in main
raise ValueError(f"No such model [{opts.model}] supported!")
ValueError: No such model [paired] supported!
This is due to copy-paste from an earlier code. Now you may ignore the parameter as it's overwritten by MODEL=chengyubert-dual
in the command line.
I will fix that on my next version.
Sorry that there are so many errors I met... Another error is that. I even don't know why it occurs, because there is only one process in my computer.
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated.
That's OK. Thank you for pointing these out!
The code supports multiple GPU. If you only have one GPU, then use CUDA_VISIBLE_DEVICES=0
.
ok! And what is 'len_idiom_vocab'?
len_idiom_vocab
is 3848 for ChID dataset.
Our next paper is under review which supports more than 30k idioms.
So it's a parameter for future compatibility.
The mode I choose is 'train'.
Your code in train_official.py is this if opts.mode == 'train':
splits, dataloaders = create_dataloaders(LOGGER, DatasetCls, EvalDatasetCls,
collate_fn, eval_collate_fn, opts, splits=['train', 'val'])
best_ckpt = train(model, dataloaders, opts)
else:
splits = []
for k in dir(opts):
if k.endswith('_txt_db'):
The error is that
AttributeError: 'Namespace' object has no attribute 'train_txt_db'
I can run the code on my machine with
CUDA_VISIBLE_DEVICES=0 CONFIG_FILE="train-official-bert-base-1gpu.json" \
bash docker_train.sh official \
"MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"
Can you post the command line?
This is mine. CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE="train-official-bert-base-1gpu.json" bash docker_train.sh official "MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"
And I run your command, it has the same error... Is my data directory losing something? I don't have train_txt_db..
()
Have you done the preprocessing without error?
Oh, I am so sorry for making wrong with that before. Thanks for your reminding!
Also, your data/pretrained
structure is not the same with the documentation.
Copy that
After preprocessing, I have the error too....
My directories have locked, is that right?
Locked is because they are generated from docker using root
. This should be working.
ok, And..
[1,1]<stderr>: raise EnvironmentError(msg)
[1,1]<stderr>:OSError: Can't load config for './pretrain/wwm_ext'. Make sure that:
[1,1]<stderr>:
[1,1]<stderr>:- './pretrain/wwm_ext' is a correct model identifier listed on 'https://huggingface.co/models'
[1,1]<stderr>:
[1,1]<stderr>:- or './pretrain/wwm_ext' is the correct path to a directory containing a config.json file
Do you have data/pretrained/wwm_ext
? If not, you need to download BERT-wwm-ext
from Chinese-BERT-wwm.
Or change the value of pretrained_model_name_or_path
in the config file to hfl/chinese-bert-wwm-ext
.
It did work! My computer doesn't have enough cuda memories. So I change the values of train batch size and val batch size to 2048. And.. subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.
What is that mean?
You need to paste the full log, it's hard to tell where the problem might be. When you post the log, try using code block
to show the log in a user-friendly format.
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ - Waiting on git info....
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ - Git branch:
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ - Git SHA:
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "train_official.py", line 468, in <module>
[1,0]<stderr>: main(args)
[1,0]<stderr>: File "train_official.py", line 317, in main
[1,0]<stderr>: best_ckpt = train(model, dataloaders, opts)
[1,0]<stderr>: File "train_official.py", line 49, in train
[1,0]<stderr>: save_training_meta(opts)
[1,0]<stderr>: File "/src/chengyubert/utils/save.py", line 45, in save_training_meta
[1,0]<stderr>: cwd=git_dir, universal_newlines=True).strip()
[1,0]<stderr>: File "/opt/conda/lib/python3.7/subprocess.py", line 411, in check_output
[1,0]<stderr>: **kwargs).stdout
[1,0]<stderr>: File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
[1,0]<stderr>: output=stdout, stderr=stderr)
[1,0]<stderr>:subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.
Thanks you very much!
Are you using a cloned repo or downloaded the zip from master?
Yes, I downloaded the zip before
The code is trying to query the git info and failed. Either you clone the repo or comment out the line which is trying to query git status.
It finally works! Thank you sooooo much!!!
Glad that works! I will add more details on next release.