[IMPORTANT]Client Sampling Frozen

zjc664656505 commented 2 years ago

Hello authors,

I'm currently implementing your work on text classification on 20news dataset. I'm using single Nvidia A6000 for this task with FedOPT algorithm, total client 50 and 2 clients per round.

After the data are loaded, once the training process comes to the client sampling part, it freezed like this: 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [1, 0]} 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ... 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 1, worker_number = 1 1133831 2022-03-08,21:49:04.511 - {gpu_mapping.py (37)} - mapping_processes_to_gpu_device_from_yaml_file(): process_id = 0, GPU device = cuda:0 1133831 2022-03-08,21:49:04.511 - {fedavg_main_tc.py (84)} - (): process_id = 0, size = 1, device=cuda:0 1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (85)} - (): torch.cuda.current_device()=0 1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (86)} - (): torch.cuda.device_count()=2 Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading index from h5 file.: 100%|██████████| 100/100 [00:00<00:00, 2576.67it/s] 1133831 2022-03-08,21:49:07.753 - {base_data_manager.py (183)} - _load_federated_data_server(): caching test index size 7532test cut off None Loading data from h5 file.: 100%|██████████| 7532/7532 [00:02<00:00, 3057.08it/s] 100%|██████████| 7532/7532 [00:03<00:00, 2410.62it/s] 1133831 2022-03-08,21:49:13.718 - {text_classification_preprocessor.py (145)} - transform_features(): 7532 features created from 7532 samples. 1133831 2022-03-08,21:49:13.764 - {base_data_manager.py (196)} - _load_federated_data_server(): caching test data size 7532 1133831 2022-03-08,21:49:13.858 - {base_data_manager.py (219)} - _load_federated_data_server(): test_dl_global number = 942 1133831 2022-03-08,21:49:13.861 - {FedOptAggregator.py (132)} - client_sampling(): client_indexes = [26 86]

I dont know why this is happening. Could you help me with this issue?

chaoyanghe commented 2 years ago

@DeviRule Please help

DeviRule commented 2 years ago

can you sent me a link to the whole log file?

zjc664656505 commented 2 years ago

Yes I can send you the link to the whole log file.

1880878 2022-03-09,20:40:55.803 - {fedavg_main_tc.py (54)} - (): Namespace(backend='MPI', ci=0, client_num_in_total=50, client_num_per_round=2, client_optimizer='adam', comm_round=10, data_file_path='/home/junchen/fednlp_data/data_files/20news_data.h5', dataset='20news', do_lower_case=True, epochs=3, eval_batch_size=8, evaluate_during_training_steps=100, fedprox_mu=1, fl_algorithm='FedOPT', fp16=False, freeze_layers='', frequency_of_the_test=1, gpu_mapping_file='gpu_mapping.yaml', gpu_mapping_key='mapping_default', gradient_accumulation_steps=1, is_debug_mode=0, is_mobile=1, lr=0.1, manual_seed=42, max_seq_length=128, model_name='distilbert-base-uncased', model_type='distilbert', n_gpu=1, output_dir='/tmp/', partition_file_path='/home/junchen/fednlp_data/partition_files/20news_partition.h5', partition_method='uniform', reprocess_input_data=False, run_id=0, server_lr=0.1, server_momentum=0, server_optimizer='sgd', train_batch_size=8, weight_decay=0) 1880878 2022-03-09,20:40:55.804 - {fedavg_main_tc.py (69)} - (): #############process ID = 0, host name = lacrymosa.ics.uci.edu########, process ID = 1880878, process Name = psutil.Process(pid=1880878, name='FedNLP-20news:0', status='running', started='20:40:52') wandb: Currently logged in as: zjc664656505 (use wandb login --relogin to force relogin) wandb: wandb version 0.12.11 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.10.12 wandb: Syncing run feasible-microwave-27 wandb: ⭐️ View project at https://wandb.ai/zjc664656505/fednlp wandb: 🚀 View run at https://wandb.ai/zjc664656505/fednlp/runs/3fqv989l wandb: Run data is saved locally in /home/junchen/FedNLP-master/experiments/distributed/transformer_exps/run_tc_exps/wandb/run-20220309_204056-3fqv989l wandb: Run wandb offline to turn off syncing.

1880878 2022-03-09,20:40:56.945 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [1, 0]} 1880878 2022-03-09,20:40:56.945 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ... 1880878 2022-03-09,20:40:56.945 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 1, worker_number = 1 1880878 2022-03-09,20:40:56.976 - {gpu_mapping.py (37)} - mapping_processes_to_gpu_device_from_yaml_file(): process_id = 0, GPU device = cuda:0 1880878 2022-03-09,20:40:56.976 - {fedavg_main_tc.py (84)} - (): process_id = 0, size = 1, device=cuda:0 1880878 2022-03-09,20:40:56.976 - {fedavg_main_tc.py (85)} - (): torch.cuda.current_device()=0 1880878 2022-03-09,20:40:56.976 - {fedavg_main_tc.py (86)} - (): torch.cuda.device_count()=2 Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading index from h5 file.: 100%|██████████| 100/100 [00:00<00:00, 2539.76it/s] 1880878 2022-03-09,20:40:59.583 - {base_data_manager.py (183)} - _load_federated_data_server(): caching test index size 7532test cut off None Loading data from h5 file.: 100%|██████████| 7532/7532 [00:02<00:00, 2950.72it/s] 100%|██████████| 7532/7532 [00:03<00:00, 2389.27it/s] 1880878 2022-03-09,20:41:05.751 - {text_classification_preprocessor.py (145)} - transform_features(): 7532 features created from 7532 samples. 1880878 2022-03-09,20:41:05.800 - {base_data_manager.py (196)} - _load_federated_data_server(): caching test data size 7532 1880878 2022-03-09,20:41:05.896 - {base_data_manager.py (219)} - _load_federated_data_server(): test_dl_global number = 942 1880878 2022-03-09,20:41:05.899 - {FedOptAggregator.py (132)} - client_sampling(): client_indexes = [26 86]

zjc664656505 commented 2 years ago

The last line is where the training process is frozen when the client sampling starts.

zjc664656505 commented 2 years ago

This is the gpu_mapping.yaml setting I followed. I only made modification on the mapping_default for now:

mapping_lambda-server2: lambda-server2: [0, 0, 0, 0, 3, 2, 3, 3]

this is used for 10 clients and 1 server training within a single machine which has 4 GPUs mapping_default: ChaoyangHe-GPU-RTX2080Tix4: [1,0]

this is used for 4 clients and 1 server training within a single machine which has 4 GPUs mapping_config1_5: host1: [2, 1, 1, 1]

this is used for 10 clients and 1 server training within a single machine which has 4 GPUs mapping_config2_11: host1: [3, 3, 3, 2]

this is used for 10 clients and 1 server training within a single machine which has 8 GPUs mapping_config3_11: host1: [2, 2, 2, 1, 1, 1, 1, 1]

this is used for 4 clients and 1 server training within a single machine which has 8 GPUs, but you hope to skip the GPU device ID. mapping_config4_5: host1: [1, 0, 0, 1, 1, 0, 1, 1]

this is used for 4 clients and 1 server training using 6 machines, each machine has 2 GPUs inside, but you hope to use the second GPU. mapping_ink-ron: ink-ron: [1, 2, 2, 2, 2, 2]

mapping_ink-lucy: ink-lucy: [2, 3, 3, 3]

mapping_a100: g1lmd2: [2, 2, 2, 1, 1, 1, 1, 1]

mapping_a100_2: g1lmd2: [2, 2, 2, 1, 1, 1, 1, 1]

DeviRule commented 2 years ago

I am confused here, which gpu_mapping are you using? And also I cannot access the wandb link: https://wandb.ai/zjc664656505/fednlp/runs/3fqv989l, seems like you made it for private use only.

zjc664656505 commented 2 years ago

Hello, as I mentioned before, the gpu_mapping I used was the default gpu mapping:

mapping_default: ChaoyangHe-GPU-RTX2080Tix4: [1,0].

I have made the wandb report public so that you can view it.

Thanks.

DeviRule commented 2 years ago

you did not assign any compute power to the client, I think it is a typo in the document, I will fix it later.

mapping_default: ChaoyangHe-GPU-RTX2080Tix4: [1,0]. it means that you assign GPU 0 with one server and assign nothing to GPU 1, In order to start training, if you have two GPUs, you need to change the GPU mapping to ChaoyangHe-GPU-RTX2080Tix4: [1,2]. meaning that you want to run a server at GPU0 and run two clients at GPU1. if you only have one GPU you need to change the GPU mapping to ChaoyangHe-GPU-RTX2080Tix4: [3].

zjc664656505 commented 2 years ago

I tried your command. However, there is a worker number dismatch issue:

2868130 2022-03-10,14:13:49.208 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [3]} 2868130 2022-03-10,14:13:49.208 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ... 2868130 2022-03-10,14:13:49.209 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 3, worker_number = 1 Traceback (most recent call last): File "fedavg_main_tc.py", line 82, in process_id, worker_number, args.gpu_mapping_file, args.gpu_mapping_key) File "/home/junchen/FedNLP-master/FedML/fedml_api/distributed/utils/gpu_mapping.py", line 33, in mapping_processes_to_gpu_device_from_yaml_file assert i == worker_number AssertionError

DeviRule commented 2 years ago

can you send me the commend you use to launch the script?

DeviRule commented 2 years ago

we would first modify gpu mapping file and use commend like "sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10" to launch,

zjc664656505 commented 2 years ago

Yes. I used the python fedavg_main_tc.py commend to run the experiment. I have double checked the gpu mapping, it should be correct since I'm only using single gpu to do the training.

zjc664656505 commented 2 years ago

The problem of using the commend you provide is:

junchen@lacrymosa:~/FedNLP-master/experiments/distributed/transformer_exps/run_tc_exps$ sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10 1 usage: fedavg_main_tc.py [-h] [--run_id RUN_ID] [--is_debug_mode IS_DEBUG_MODE] [--dataset N] [--data_file_path DATA_FILE_PATH] [--partition_file_path PARTITION_FILE_PATH] [--partition_method PARTITION_METHOD] [--model_type N] [--model_name N] [--do_lower_case N] [--train_batch_size N] [--eval_batch_size N] [--max_seq_length N] [--n_gpu EP] [--fp16] [--manual_seed N] [--output_dir N] [--fl_algorithm FL_ALGORITHM] [--backend BACKEND] [--comm_round COMM_ROUND] [--is_mobile IS_MOBILE] [--client_num_in_total NN] [--client_num_per_round NN] [--epochs EP] [--gradient_accumulation_steps EP] [--client_optimizer CLIENT_OPTIMIZER] [--lr LR] [--weight_decay N] [--server_optimizer SERVER_OPTIMIZER] [--server_lr SERVER_LR] [--server_momentum SERVER_MOMENTUM] [--fedprox_mu FEDPROX_MU] [--evaluate_during_training_steps EP] [--frequency_of_the_test FREQUENCY_OF_THE_TEST] [--gpu_mapping_file GPU_MAPPING_FILE] [--gpu_mapping_key GPU_MAPPING_KEY] [--ci CI] [--reprocess_input_data] [--freeze_layers N] fedavg_main_tc.py: error: argument --client_num_per_round: expected one argument

This error always prompts out regardless what argument I give to the --client_num_per_round

zjc664656505 commented 2 years ago

Also, I have already defined all of the required arguments in the initializer.py file. I'm not sure why this is happening.

DeviRule commented 2 years ago

just this run "sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10" no need to add anything more, client number will automatically adjust. Make sure to modify --gpu_mapping_key and dataset related argument in the bash script

zjc664656505 commented 2 years ago

I tried multiple times of this command and made modification on the sh file regarding the gpi mapping key and the dataset related argument, but it still does not work. The error shows like this:

junchen@lacrymosa:~/FedNLP-master/experiments/distributed/transformer_exps/run_tc_exps$ sh run_text_classification.sh FedOPT "uniform" 5e-5 0.1 10 expr: non-integer argument

[proxy:0:0@lacrymosa.ics.uci.edu] HYDU_create_process (utils/launch/launch.c:74): execvp error on file mpi_host_file (No such file or directory)

zjc664656505 commented 2 years ago

Here is the run_text_classification.sh file content:

FL_ALG=$1 PARTITION_METHOD=$2 C_LR=$3 S_LR=$4 ROUND=$5 WORKER_NUM=$3

LOG_FILE="fedavg_transformer_tc.log" CI=0

DATA_DIR=~/fednlp_data/ DATA_NAME=20news PROCESS_NUM=expr $WORKER_NUM + 1 echo $PROCESS_NUM

hostname > mpi_host_file

mpirun -np $PROCESS_NUM -hostfile mpi_host_file \ python -m fedavg_main_tc \ --gpu_mapping_file "gpu_mapping.yaml" \ --gpu_mapping_key mapping_default \ --client_num_per_round $WORKER_NUM \ --comm_round $ROUND \ --ci $CI \ --dataset "${DATA_NAME}" \ --data_file "${DATA_DIR}/data_files/${DATA_NAME}_data.h5" \ --partition_file "${DATA_DIR}/partition_files/${DATA_NAME}_partition.h5" \ --partition_method $PARTITION_METHOD \ --fl_algorithm $FL_ALG \ --model_type distilbert \ --model_name distilbert-base-uncased \ --do_lower_case True \ --train_batch_size 32 \ --eval_batch_size 8 \ --max_seq_length 256 \ --lr $C_LR \ --server_lr $S_LR \ --epochs 1 \ --outputdir "/tmp/fedavg${DATA_NAME}_output/"

zjc664656505 commented 2 years ago

I double checked my directory and the mpi_host_file is in my directory, I'm confused why this is happening.

DeviRule commented 2 years ago

@chaoyanghe chaoyang can you check on this. It seems related to fedml

zjc664656505 commented 2 years ago

After recofiguring the environment, the code can finally run, thank you so much for your help!!

FedML-AI / FedNLP

[IMPORTANT]Client Sampling Frozen #28