Closed zjc664656505 closed 2 years ago
@DeviRule Please help
can you sent me a link to the whole log file?
Yes I can send you the link to the whole log file.
1880878 2022-03-09,20:40:55.803 - {fedavg_main_tc.py (54)} - wandb login --relogin
to force relogin)
wandb: wandb version 0.12.11 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.10.12
wandb: Syncing run feasible-microwave-27
wandb: ⭐️ View project at https://wandb.ai/zjc664656505/fednlp
wandb: 🚀 View run at https://wandb.ai/zjc664656505/fednlp/runs/3fqv989l
wandb: Run data is saved locally in /home/junchen/FedNLP-master/experiments/distributed/transformer_exps/run_tc_exps/wandb/run-20220309_204056-3fqv989l
wandb: Run wandb offline
to turn off syncing.
1880878 2022-03-09,20:40:56.945 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [1, 0]}
1880878 2022-03-09,20:40:56.945 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ...
1880878 2022-03-09,20:40:56.945 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 1, worker_number = 1
1880878 2022-03-09,20:40:56.976 - {gpu_mapping.py (37)} - mapping_processes_to_gpu_device_from_yaml_file(): process_id = 0, GPU device = cuda:0
1880878 2022-03-09,20:40:56.976 - {fedavg_main_tc.py (84)} -
The last line is where the training process is frozen when the client sampling starts.
This is the gpu_mapping.yaml setting I followed. I only made modification on the mapping_default for now:
mapping_lambda-server2: lambda-server2: [0, 0, 0, 0, 3, 2, 3, 3]
this is used for 10 clients and 1 server training within a single machine which has 4 GPUs mapping_default: ChaoyangHe-GPU-RTX2080Tix4: [1,0]
this is used for 4 clients and 1 server training within a single machine which has 4 GPUs mapping_config1_5: host1: [2, 1, 1, 1]
this is used for 10 clients and 1 server training within a single machine which has 4 GPUs mapping_config2_11: host1: [3, 3, 3, 2]
this is used for 10 clients and 1 server training within a single machine which has 8 GPUs mapping_config3_11: host1: [2, 2, 2, 1, 1, 1, 1, 1]
this is used for 4 clients and 1 server training within a single machine which has 8 GPUs, but you hope to skip the GPU device ID. mapping_config4_5: host1: [1, 0, 0, 1, 1, 0, 1, 1]
this is used for 4 clients and 1 server training using 6 machines, each machine has 2 GPUs inside, but you hope to use the second GPU. mapping_ink-ron: ink-ron: [1, 2, 2, 2, 2, 2]
mapping_ink-lucy: ink-lucy: [2, 3, 3, 3]
mapping_a100: g1lmd2: [2, 2, 2, 1, 1, 1, 1, 1]
mapping_a100_2: g1lmd2: [2, 2, 2, 1, 1, 1, 1, 1]
I am confused here, which gpu_mapping are you using? And also I cannot access the wandb link: https://wandb.ai/zjc664656505/fednlp/runs/3fqv989l, seems like you made it for private use only.
Hello, as I mentioned before, the gpu_mapping I used was the default gpu mapping:
mapping_default: ChaoyangHe-GPU-RTX2080Tix4: [1,0].
I have made the wandb report public so that you can view it.
Thanks.
you did not assign any compute power to the client, I think it is a typo in the document, I will fix it later.
mapping_default: ChaoyangHe-GPU-RTX2080Tix4: [1,0]. it means that you assign GPU 0 with one server and assign nothing to GPU 1, In order to start training, if you have two GPUs, you need to change the GPU mapping to ChaoyangHe-GPU-RTX2080Tix4: [1,2]. meaning that you want to run a server at GPU0 and run two clients at GPU1. if you only have one GPU you need to change the GPU mapping to ChaoyangHe-GPU-RTX2080Tix4: [3].
I tried your command. However, there is a worker number dismatch issue:
2868130 2022-03-10,14:13:49.208 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [3]}
2868130 2022-03-10,14:13:49.208 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ...
2868130 2022-03-10,14:13:49.209 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 3, worker_number = 1
Traceback (most recent call last):
File "fedavg_main_tc.py", line 82, in
can you send me the commend you use to launch the script?
we would first modify gpu mapping file and use commend like "sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10" to launch,
Yes. I used the python fedavg_main_tc.py commend to run the experiment. I have double checked the gpu mapping, it should be correct since I'm only using single gpu to do the training.
The problem of using the commend you provide is:
junchen@lacrymosa:~/FedNLP-master/experiments/distributed/transformer_exps/run_tc_exps$ sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10 1 usage: fedavg_main_tc.py [-h] [--run_id RUN_ID] [--is_debug_mode IS_DEBUG_MODE] [--dataset N] [--data_file_path DATA_FILE_PATH] [--partition_file_path PARTITION_FILE_PATH] [--partition_method PARTITION_METHOD] [--model_type N] [--model_name N] [--do_lower_case N] [--train_batch_size N] [--eval_batch_size N] [--max_seq_length N] [--n_gpu EP] [--fp16] [--manual_seed N] [--output_dir N] [--fl_algorithm FL_ALGORITHM] [--backend BACKEND] [--comm_round COMM_ROUND] [--is_mobile IS_MOBILE] [--client_num_in_total NN] [--client_num_per_round NN] [--epochs EP] [--gradient_accumulation_steps EP] [--client_optimizer CLIENT_OPTIMIZER] [--lr LR] [--weight_decay N] [--server_optimizer SERVER_OPTIMIZER] [--server_lr SERVER_LR] [--server_momentum SERVER_MOMENTUM] [--fedprox_mu FEDPROX_MU] [--evaluate_during_training_steps EP] [--frequency_of_the_test FREQUENCY_OF_THE_TEST] [--gpu_mapping_file GPU_MAPPING_FILE] [--gpu_mapping_key GPU_MAPPING_KEY] [--ci CI] [--reprocess_input_data] [--freeze_layers N] fedavg_main_tc.py: error: argument --client_num_per_round: expected one argument
This error always prompts out regardless what argument I give to the --client_num_per_round
Also, I have already defined all of the required arguments in the initializer.py file. I'm not sure why this is happening.
just this run "sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10" no need to add anything more, client number will automatically adjust. Make sure to modify --gpu_mapping_key and dataset related argument in the bash script
I tried multiple times of this command and made modification on the sh file regarding the gpi mapping key and the dataset related argument, but it still does not work. The error shows like this:
junchen@lacrymosa:~/FedNLP-master/experiments/distributed/transformer_exps/run_tc_exps$ sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10 expr: non-integer argument
[proxy:0:0@lacrymosa.ics.uci.edu] HYDU_create_process (utils/launch/launch.c:74): execvp error on file mpi_host_file (No such file or directory)
Here is the run_text_classification.sh file content:
FL_ALG=$1 PARTITION_METHOD=$2 C_LR=$3 S_LR=$4 ROUND=$5 WORKER_NUM=$3
LOG_FILE="fedavg_transformer_tc.log" CI=0
DATA_DIR=~/fednlp_data/
DATA_NAME=20news
PROCESS_NUM=expr $WORKER_NUM + 1
echo $PROCESS_NUM
hostname > mpi_host_file
mpirun -np $PROCESS_NUM -hostfile mpi_host_file \ python -m fedavg_main_tc \ --gpu_mapping_file "gpu_mapping.yaml" \ --gpu_mapping_key mapping_default \ --client_num_per_round $WORKER_NUM \ --comm_round $ROUND \ --ci $CI \ --dataset "${DATA_NAME}" \ --data_file "${DATA_DIR}/data_files/${DATA_NAME}_data.h5" \ --partition_file "${DATA_DIR}/partition_files/${DATA_NAME}_partition.h5" \ --partition_method $PARTITION_METHOD \ --fl_algorithm $FL_ALG \ --model_type distilbert \ --model_name distilbert-base-uncased \ --do_lower_case True \ --train_batch_size 32 \ --eval_batch_size 8 \ --max_seq_length 256 \ --lr $C_LR \ --server_lr $S_LR \ --epochs 1 \ --outputdir "/tmp/fedavg${DATA_NAME}_output/"
I double checked my directory and the mpi_host_file is in my directory, I'm confused why this is happening.
@chaoyanghe chaoyang can you check on this. It seems related to fedml
After recofiguring the environment, the code can finally run, thank you so much for your help!!
Hello authors,
I'm currently implementing your work on text classification on 20news dataset. I'm using single Nvidia A6000 for this task with FedOPT algorithm, total client 50 and 2 clients per round.
After the data are loaded, once the training process comes to the client sampling part, it freezed like this: 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [1, 0]} 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ... 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 1, worker_number = 1 1133831 2022-03-08,21:49:04.511 - {gpu_mapping.py (37)} - mapping_processes_to_gpu_device_from_yaml_file(): process_id = 0, GPU device = cuda:0 1133831 2022-03-08,21:49:04.511 - {fedavg_main_tc.py (84)} -(): process_id = 0, size = 1, device=cuda:0
1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (85)} - (): torch.cuda.current_device()=0
1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (86)} - (): torch.cuda.device_count()=2
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
I dont know why this is happening. Could you help me with this issue?