dptech-corp / Uni-Core

an efficient distributed PyTorch framework
MIT License
124 stars 32 forks source link

Issues encountered when using Uni-Core in Uni-Mol #34

Open jerermyyoung opened 1 year ago

jerermyyoung commented 1 year ago

I tried to run the fine-tuning script provided in Uni-Mol (pasted here for easy reference).

data_path="./molecular_property_prediction"  # replace to your data path
save_dir="./save_finetune"  # replace to your save path
n_gpu=4
MASTER_PORT=10086
dict_name="dict.txt"
weight_path="./weights/checkpoint.pt"  # replace to your ckpt path
task_name="qm9dft"  # molecular property prediction task name 
task_num=3
loss_func="finetune_smooth_mae"
lr=1e-4
batch_size=32
epoch=40
dropout=0
warmup=0.06
local_batch_size=32
only_polar=0
conf_size=11
seed=0

if [ "$task_name" == "qm7dft" ] || [ "$task_name" == "qm8dft" ] || [ "$task_name" == "qm9dft" ]; then
    metric="valid_agg_mae"
elif [ "$task_name" == "esol" ] || [ "$task_name" == "freesolv" ] || [ "$task_name" == "lipo" ]; then
    metric="valid_agg_rmse"
else 
    metric="valid_agg_auc"
fi

export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1
update_freq=`expr $batch_size / $local_batch_size`
python -m torch.distributed.launch --nproc_per_node=$n_gpu --master_port=$MASTER_PORT $(which unicore-train) $data_path --task-name $task_name --user-dir ./unimol --train-subset train --valid-subset valid \
       --conf-size $conf_size \
       --num-workers 8 --ddp-backend=c10d \
       --dict-name $dict_name \
       --task mol_finetune --loss $loss_func --arch unimol_base  \
       --classification-head-name $task_name --num-classes $task_num \
       --optimizer adam --adam-betas "(0.9, 0.99)" --adam-eps 1e-6 --clip-norm 1.0 \
       --lr-scheduler polynomial_decay --lr $lr --warmup-ratio $warmup --max-epoch $epoch --batch-size $local_batch_size --pooler-dropout $dropout\
       --update-freq $update_freq --seed $seed \
       --fp16 --fp16-init-scale 4 --fp16-scale-window 256 \
       --log-interval 100 --log-format simple \
       --validate-interval 1 \
       --finetune-from-model $weight_path \
       --best-checkpoint-metric $metric --patience 20 \
       --save-dir $save_dir --only-polar $only_polar \
       --reg

# --reg, for regression task
# --maximize-best-checkpoint-metric, for classification task

However, I encountered the following error:

unicore-train: error: unrecognized arguments: --local-rank=0

and the argument --local-rank does not even appear in Uni-Core. I am using PyTorch 2.0, and the log also warns me that:

If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions 

It confuses me whether it means Uni-Core does not support PyTorch 2.0 (which seems not likely), or is there another problem?

guolinke commented 1 year ago

if you use pytorch 2.0, please make sure your version is not earlier than https://github.com/dptech-corp/Uni-Core/tree/91ebaa0a73ac7ef52b57e9e8f6ddf22e32eb3c2e

wayyzt commented 1 day ago

open the file called options.py (probably dir : ~/miniconda3/envs/unicore/lib/python3.10/site-packages/unicore/options.py)

replace the '--local_rank' to '--local-rank' : group.add_argument('--device-id', '--local-rank', default=0, type=int, help='which GPU to use (usually configured automatically)')