FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.18k stars 788 forks source link

One server and one client example in FedCV object detection #399

Open Adeelbek opened 2 years ago

Adeelbek commented 2 years ago

Hi,

I build FedML platform using docker container provided by the authors. To see the performance of simple one server and one client example, I run run_server.sh and run_client.sh scripts inside the object_detection/ directory. Then, I started receiving following state in server and client terminal

Server side:

mqtt_s3.send_message: msg topic = fedml_yolov5_0_2
mqtt_s3.send_message: msg topic = fedml_yolov5_0_2
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:249:send_message] mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [fedml_server_manager.py:101:handle_messag_connection_ready] Connection ready for client2
Connection ready for client2
Connection ready for client2
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:140:on_connected] mqtt_s3.on_connect: server subscribes
mqtt_s3.on_connect: server subscribes
mqtt_s3.on_connect: server subscribes
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:198:_on_message_impl] mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:172:_notify] mqtt_s3.notify: msg type = 5
mqtt_s3.notify: msg type = 5
mqtt_s3.notify: msg type = 5
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [server_manager.py:155:receive_message] receive_message. rank_id = 0, msg_type = 5.
receive_message. rank_id = 0, msg_type = 5.
receive_message. rank_id = 0, msg_type = 5.
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [fedml_server_manager.py:111:handle_message_client_status_update] self.client_online_mapping = {'1': True}
self.client_online_mapping = {'1': True}
self.client_online_mapping = {'1': True}
[FedML-Server(0) @device-id-0] [Thu, 21 Jul 2022 02:36:06] [INFO] [fedml_server_manager.py:126:handle_message_client_status_update] sender_id = 1, all_client_is_online = False
sender_id = 1, all_client_is_online = False
sender_id = 1, all_client_is_online = False

Client side

[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:33:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:148:on_connected] mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 17, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 17, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 17, result = 0
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:33:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:198:_on_message_impl] mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:33:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:172:_notify] mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:33:03] [INFO] [fedml_client_master_manager.py:162:send_client_status] send_client_status
send_client_status
send_client_status
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:33:03] [INFO] [client_manager.py:157:send_message] Sending message (type 5) to server
Sending message (type 5) to server
Sending message (type 5) to server
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:33:03] [INFO] [mqtt_s3_multi_clients_comm_manager.py:277:send_message] mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:34:04] [INFO] [fedml_client_master_manager.py:62:handle_message_connection_ready] Connection is ready!
Connection is ready!
Connection is ready!
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:34:04] [INFO] [mqtt_s3_multi_clients_comm_manager.py:148:on_connected] mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 19, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 19, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 19, result = 0
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:34:04] [INFO] [mqtt_s3_multi_clients_comm_manager.py:198:_on_message_impl] mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:34:04] [INFO] [mqtt_s3_multi_clients_comm_manager.py:172:_notify] mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:34:04] [INFO] [fedml_client_master_manager.py:162:send_client_status] send_client_status
send_client_status
send_client_status
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:34:04] [INFO] [client_manager.py:157:send_message] Sending message (type 5) to server
Sending message (type 5) to server
Sending message (type 5) to server
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:34:04] [INFO] [mqtt_s3_multi_clients_comm_manager.py:277:send_message] mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:35:05] [INFO] [fedml_client_master_manager.py:62:handle_message_connection_ready] Connection is ready!
Connection is ready!
Connection is ready!
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:35:05] [INFO] [mqtt_s3_multi_clients_comm_manager.py:148:on_connected] mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 21, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 21, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 21, result = 0
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:35:05] [INFO] [mqtt_s3_multi_clients_comm_manager.py:198:_on_message_impl] mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:35:05] [INFO] [mqtt_s3_multi_clients_comm_manager.py:172:_notify] mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:35:05] [INFO] [fedml_client_master_manager.py:162:send_client_status] send_client_status
send_client_status
send_client_status
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:35:05] [INFO] [client_manager.py:157:send_message] Sending message (type 5) to server
Sending message (type 5) to server
Sending message (type 5) to server
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:35:05] [INFO] [mqtt_s3_multi_clients_comm_manager.py:277:send_message] mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:36:06] [INFO] [fedml_client_master_manager.py:62:handle_message_connection_ready] Connection is ready!
Connection is ready!
Connection is ready!
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:148:on_connected] mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 23, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 23, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 23, result = 0
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:198:_on_message_impl] mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:172:_notify] mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:36:06] [INFO] [fedml_client_master_manager.py:162:send_client_status] send_client_status
send_client_status
send_client_status
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:36:06] [INFO] [client_manager.py:157:send_message] Sending message (type 5) to server
Sending message (type 5) to server
Sending message (type 5) to server
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:36:06] [INFO] [mqtt_s3_multi_clients_comm_manager.py:277:send_message] mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:37:07] [INFO] [fedml_client_master_manager.py:62:handle_message_connection_ready] Connection is ready!
Connection is ready!
Connection is ready!
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:37:07] [INFO] [mqtt_s3_multi_clients_comm_manager.py:148:on_connected] mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 25, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 25, result = 0
mqtt_s3.on_connect: client subscribes real_topic = fedml_yolov5_0_1, mid = 25, result = 0
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:37:07] [INFO] [mqtt_s3_multi_clients_comm_manager.py:198:_on_message_impl] mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
mqtt_s3.on_message: not use s3 pack
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:37:07] [INFO] [mqtt_s3_multi_clients_comm_manager.py:172:_notify] mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
mqtt_s3.notify: msg type = 6
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:37:07] [INFO] [fedml_client_master_manager.py:162:send_client_status] send_client_status
send_client_status
send_client_status
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:37:07] [INFO] [client_manager.py:157:send_message] Sending message (type 5) to server
Sending message (type 5) to server
Sending message (type 5) to server
[FedML-Client(1) @device-id-1] [Thu, 21 Jul 2022 02:37:07] [INFO] [mqtt_s3_multi_clients_comm_manager.py:277:send_message] mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent
mqtt_s3.send_message: MQTT msg sent

I suspect that there is some problem in the connection between client and server. I am using the same PC for client and for server Any hint and suggestion is highly appreciated !!!

beiyuouo commented 2 years ago

It seems that the connection is successful. But in the config/fedml_config.yaml, the client_num_per_round is 2. You can change it to 1 or you can launch two clients using bash run_client.sh 1 and bash run_client.sh 2.

By the way, you can get more advantage usage at here

Adeelbek commented 2 years ago

@beiyuouo Thanks for your quick and kind response. It's been a while since started the training between client and server, but so far I haven't seen any details or information about training status, neither a number of training epochs nor intermediate accuracy calculation. Does FedML automatically calculates accuracy metrics or should it be added inside the code?

beiyuouo commented 2 years ago

@Adeelbek We have written a lot of metrics log information during the training process, which does not need users to add. Maybe you have some problems before the training process. Can you provide more information? And did you run the bootstrap.sh in config before starting the server, maybe you should run it first.

Adeelbek commented 2 years ago

Hi @beiyuouo In my previous trial, I did not run the bootstrap.sh file before running the training. I stopped the training and then I run bash bootstrap.sh file located in config file. After that, I ran training with one server and two client scenarios but still cannot get any training metric information (mAP, AP or Recall). I am getting only communication information between clients and servers as I indicated above. Do I need to install additional libraries? Currently, I have a docker environment with preinstalled OpenCV, seaborn, pandas, etc. The followings are my environment details:


Package                 Version             
----------------------- --------------------
absl-py                 1.1.0               
addict                  2.4.0               
aliyun-log-python-sdk   0.7.9               
asttokens               2.0.5               
backcall                0.2.0               
backports.zoneinfo      0.2.1               
blis                    0.7.8               
boto3                   1.22.11             
botocore                1.25.11             
cachetools              5.2.0               
catalogue               2.0.7               
certifi                 2019.11.28          
cffi                    1.15.0              
chardet                 3.0.4               
charset-normalizer      2.0.12              
click                   8.1.3               
cmake                   3.22.4              
commonmark              0.9.1               
cycler                  0.11.0              
cymem                   2.0.6               
dataclasses             0.6                 
dateparser              1.1.1               
dbus-python             1.2.16              
decorator               5.1.1               
dill                    0.3.5.1             
docker-pycreds          0.4.0               
elastic-transport       8.1.2               
elasticsearch           8.2.0               
executing               0.8.3               
fedml                   0.7.210             
flatbuffers             2.0                 
fonttools               4.34.4              
future                  0.18.2              
gensim                  4.2.0               
gitdb                   4.0.9               
GitPython               3.1.27              
google-auth             2.9.1               
google-auth-oauthlib    0.4.6               
grpcio                  1.46.0              
h5py                    3.6.0               
idna                    2.8                 
importlib-metadata      4.12.0              
intel-openmp            2022.1.0            
iotop                   0.6                 
ipython                 8.4.0               
jedi                    0.18.1              
Jinja2                  3.1.2               
jmespath                1.0.0               
joblib                  1.1.0               
kiwisolver              1.4.4               
langcodes               3.3.0               
Markdown                3.4.1               
MarkupSafe              2.1.1               
matplotlib              3.5.2               
matplotlib-inline       0.1.3               
mkl                     2022.1.0            
mkl-include             2022.1.0            
MNN                     1.1.6               
mpi4py                  3.0.3               
multiprocess            0.70.13             
murmurhash              1.0.7               
nano                    0.10.0              
networkx                2.8                 
ninja                   1.10.2.3            
numpy                   1.22.3              
oauthlib                3.2.0               
onnx                    1.7.0               
onnx-simplifier         0.4.0               
onnxruntime             1.11.1              
onnxsim-no-ort          0.4.0               
opencv-python           4.6.0.66            
opencv-python-headless  4.6.0.66            
packaging               21.3                
paho-mqtt               1.6.1               
pandas                  1.4.3               
parso                   0.8.3               
pathtools               0.1.2               
pathy                   0.6.2               
pexpect                 4.8.0               
pickleshare             0.7.5               
Pillow                  9.1.0               
pip                     20.0.2              
preshed                 3.0.6               
promise                 2.3                 
prompt-toolkit          3.0.30              
protobuf                3.19.4              
psutil                  5.9.0               
ptyprocess              0.7.0               
pure-eval               0.2.2
pyasn1                  0.4.8               
pyasn1-modules          0.2.8               
pycocotools             2.0.4               
pycparser               2.21                
pydantic                1.9.1               
Pygments                2.12.0              
PyGObject               3.36.0              
pynvml                  11.4.1              
pyparsing               3.0.8               
python-apt              2.0.0+ubuntu0.20.4.7
python-dateutil         2.8.2               
pytz                    2022.1              
pytz-deprecation-shim   0.1.0.post0         
PyYAML                  5.3.1               
regex                   2022.3.2            
requests                2.27.1              
requests-oauthlib       1.3.1               
requests-unixsocket     0.2.0               
rich                    12.5.1              
rsa                     4.8                 
s3transfer              0.5.2               
scikit-learn            1.1.0rc1            
scipy                   1.8.0               
seaborn                 0.11.2              
sentry-sdk              1.5.12              
setproctitle            1.2.3               
setuptools              45.2.0              
shortuuid               1.0.9               
six                     1.14.0              
sklearn                 0.0                 
smart-open              6.0.0               
smmap                   5.0.0               
spacy                   3.4.0               
spacy-legacy            3.0.9               
spacy-loggers           1.0.3               
srsly                   2.4.3               
stack-data              0.3.0               
supervisor              4.2.4               
tbb                     2021.6.0            
tensorboard             2.9.1               
tensorboard-data-server 0.6.1               
tensorboard-plugin-wit  1.8.1               
thinc                   8.1.0               
thop                    0.1.1.post2207130030
threadpoolctl           3.1.0               
torch                   1.11.0              
torch-geometric         2.0.5               
torchvision             0.12.0              
tqdm                    4.64.0              
traitlets               5.3.0               
typer                   0.4.2               
typing-extensions       4.2.0               
tzdata                  2022.1              
tzlocal                 4.2                 
urllib3                 1.26.9              
wandb                   0.12.16             
wasabi                  0.9.1               
wcwidth                 0.2.5               
Werkzeug                2.1.2               
wget                    3.2                 
wheel                   0.34.2              
zipp                    3.8.1
beiyuouo commented 2 years ago

@Adeelbek Hi, could you run fedml env to provide more context information

Adeelbek commented 2 years ago

Hi @beiyuouo, Thanks for your support. Actually, the problem is solved after upgrading torch version from 1.11.0 to 1.12.0+cu116. Probably, the one who directly uses docker image should double check their cuda driver version and torch compatibility. Currently, I have 8 GPUs (3090) on my server pc but when run fedml env command, it is showing no GPU message on the terminal as follows:

======== FedML (https://fedml.ai) ========
FedML version: 0.7.210
Execution path:/usr/local/lib/python3.8/dist-packages/fedml/__init__.py

======== Running Environment ========
OS: Linux-5.4.0-117-generic-x86_64-with-glibc2.29
Hardware: x86_64
Python version: 3.8.10 (default, Mar 15 2022, 12:22:08) 
[GCC 9.4.0]
PyTorch version: 1.12.0+cu116
MPI4py is installed

======== CPU Configuration ========
The CPU usage is : 26%
Available CPU Memory: 205.3 G / 376.5395622253418G

======== GPU Configuration ========
No GPU devices
fedml@gpusystem:/home/gpuadmin/OPD/FedML$ nvidia-smi 
Tue Jul 26 00:45:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 65%   68C    P2   229W / 350W |  23697MiB / 24576MiB |     40%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 30%   37C    P8    23W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:1F:00.0 Off |                  N/A |
| 30%   38C    P8    21W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:20:00.0 Off |                  N/A |
| 30%   35C    P8    24W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:21:00.0 Off |                  N/A |
| 68%   70C    P2   303W / 350W |  15389MiB / 24576MiB |     68%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:22:00.0 Off |                  N/A |
| 76%   72C    P2   312W / 350W |  15163MiB / 24576MiB |     90%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:23:00.0 Off |                  N/A |
| 88%   74C    P2   304W / 350W |  15165MiB / 24576MiB |     88%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:24:00.0 Off |                  N/A |
| 71%   70C    P2   312W / 350W |  15163MiB / 24576MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |

Odilbek, [2022-07-26 오전 9:58]
|        ID   ID                                                   Usage      |
|=============================================================================|
+----------------------------------------------------

Do you have any idea why it can't recognize the GPUs? While running training, I tried to use the predefined GPUs by creating gpu_mapping.yaml. It is actually using all GPUs in the same fashion that I have pre-assigned. However, the usage of GPU is very low like 1~3% of GPU memory. Is this normal ?

Adeelbek commented 2 years ago

Hi there, I trained YOLOv5 in server and client for 120 epochs. However, I haven't got any stored weights for server or client in predefined directory which is ~/object_detection/runs/. What could be the problem? One more thing, in ./config/fedml_config.yaml file, I see weights are initialized as weights='none'. Why don't we just use some pretrained weights which are publicly available in models (YOLOv5, v6, v7) github repos (like: weights='yolov5s.pt')?

Adeelbek commented 2 years ago

no_weights

beiyuouo commented 2 years ago

Yes, you're right. You can use the pretrained model by setting the weights in the config file. If you are using MLOps you can download the final model directly. But if you are in the simulation scheme, it may not have checkpoints currently

xierongpytorch commented 2 years ago

Yes, you're right. You can use the pretrained model by setting the weights in the config file. If you are using MLOps you can download the final model directly. But if you are in the simulation scheme, it may not have checkpoints currently

Thanks for your suggestion, I tried changing weights: "yolov5s.py", but runs/train/exp10/weights is still empty. How can I know the effect of training. Thanks!

Adeelbek commented 2 years ago

@xierongpytorch Actually, you can enable wandb from your configuration file to see the training details while doing distributed training. At least, I was able to see the effect of training from wandb when I used their old platform which has been deleted from their previous github repo. In current object detection task, they included wandb enabling/disabling option in config/fedm_config.yaml file defined as enable_wandb: false. Simply, you can enable setting false to true. However, when you enable wandb option, you will come across many runtime errors which might not be solvable. So currently effect of training cannot be seen without fixing bugs or adding some codes.

czstudio commented 2 years ago

I also cannot get any training metric information (mAP, AP or Recall), and wandb only have "BusyTime,...." ,command line only have mloss

xierongpytorch commented 2 years ago

@xierongpytorch Actually, you can enable wandb from your configuration file to see the training details while doing distributed training. At least, I was able to see the effect of training from wandb when I used their old platform which has been deleted from their previous github repo. In current object detection task, they included wandb enabling/disabling option in config/fedm_config.yaml file defined as enable_wandb: false. Simply, you can enable setting false to true. However, when you enable wandb option, you will come across many runtime errors which might not be solvable. So currently effect of training cannot be seen without fixing bugs or adding some codes.

Thanks for the instructive advice! I can't wait to follow your suggestion and use wandb successfully, I get a lot of Time, I think I don't understand the meaning of FedCV parameters, how can I learn more parameters details. Also I would like to ask you sincerely, how to view the training weights? Many thanks!