Open Adeelbek opened 2 years ago
It seems that the connection is successful. But in the config/fedml_config.yaml
, the client_num_per_round
is 2
. You can change it to 1
or you can launch two clients using bash run_client.sh 1
and bash run_client.sh 2
.
By the way, you can get more advantage usage at here
@beiyuouo Thanks for your quick and kind response. It's been a while since started the training between client and server, but so far I haven't seen any details or information about training status, neither a number of training epochs nor intermediate accuracy calculation. Does FedML automatically calculates accuracy metrics or should it be added inside the code?
@Adeelbek We have written a lot of metrics log information during the training process, which does not need users to add. Maybe you have some problems before the training process. Can you provide more information? And did you run the bootstrap.sh
in config
before starting the server, maybe you should run it first.
Hi @beiyuouo
In my previous trial, I did not run the bootstrap.sh
file before running the training. I stopped the training and then I run bash bootstrap.sh
file located in config
file. After that, I ran training with one server and two client scenarios but still cannot get any training metric information (mAP, AP or Recall). I am getting only communication information between clients and servers as I indicated above. Do I need to install additional libraries? Currently, I have a docker environment with preinstalled OpenCV, seaborn, pandas, etc. The followings are my environment details:
Package Version
----------------------- --------------------
absl-py 1.1.0
addict 2.4.0
aliyun-log-python-sdk 0.7.9
asttokens 2.0.5
backcall 0.2.0
backports.zoneinfo 0.2.1
blis 0.7.8
boto3 1.22.11
botocore 1.25.11
cachetools 5.2.0
catalogue 2.0.7
certifi 2019.11.28
cffi 1.15.0
chardet 3.0.4
charset-normalizer 2.0.12
click 8.1.3
cmake 3.22.4
commonmark 0.9.1
cycler 0.11.0
cymem 2.0.6
dataclasses 0.6
dateparser 1.1.1
dbus-python 1.2.16
decorator 5.1.1
dill 0.3.5.1
docker-pycreds 0.4.0
elastic-transport 8.1.2
elasticsearch 8.2.0
executing 0.8.3
fedml 0.7.210
flatbuffers 2.0
fonttools 4.34.4
future 0.18.2
gensim 4.2.0
gitdb 4.0.9
GitPython 3.1.27
google-auth 2.9.1
google-auth-oauthlib 0.4.6
grpcio 1.46.0
h5py 3.6.0
idna 2.8
importlib-metadata 4.12.0
intel-openmp 2022.1.0
iotop 0.6
ipython 8.4.0
jedi 0.18.1
Jinja2 3.1.2
jmespath 1.0.0
joblib 1.1.0
kiwisolver 1.4.4
langcodes 3.3.0
Markdown 3.4.1
MarkupSafe 2.1.1
matplotlib 3.5.2
matplotlib-inline 0.1.3
mkl 2022.1.0
mkl-include 2022.1.0
MNN 1.1.6
mpi4py 3.0.3
multiprocess 0.70.13
murmurhash 1.0.7
nano 0.10.0
networkx 2.8
ninja 1.10.2.3
numpy 1.22.3
oauthlib 3.2.0
onnx 1.7.0
onnx-simplifier 0.4.0
onnxruntime 1.11.1
onnxsim-no-ort 0.4.0
opencv-python 4.6.0.66
opencv-python-headless 4.6.0.66
packaging 21.3
paho-mqtt 1.6.1
pandas 1.4.3
parso 0.8.3
pathtools 0.1.2
pathy 0.6.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.1.0
pip 20.0.2
preshed 3.0.6
promise 2.3
prompt-toolkit 3.0.30
protobuf 3.19.4
psutil 5.9.0
ptyprocess 0.7.0
pure-eval 0.2.2
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycocotools 2.0.4
pycparser 2.21
pydantic 1.9.1
Pygments 2.12.0
PyGObject 3.36.0
pynvml 11.4.1
pyparsing 3.0.8
python-apt 2.0.0+ubuntu0.20.4.7
python-dateutil 2.8.2
pytz 2022.1
pytz-deprecation-shim 0.1.0.post0
PyYAML 5.3.1
regex 2022.3.2
requests 2.27.1
requests-oauthlib 1.3.1
requests-unixsocket 0.2.0
rich 12.5.1
rsa 4.8
s3transfer 0.5.2
scikit-learn 1.1.0rc1
scipy 1.8.0
seaborn 0.11.2
sentry-sdk 1.5.12
setproctitle 1.2.3
setuptools 45.2.0
shortuuid 1.0.9
six 1.14.0
sklearn 0.0
smart-open 6.0.0
smmap 5.0.0
spacy 3.4.0
spacy-legacy 3.0.9
spacy-loggers 1.0.3
srsly 2.4.3
stack-data 0.3.0
supervisor 4.2.4
tbb 2021.6.0
tensorboard 2.9.1
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
thinc 8.1.0
thop 0.1.1.post2207130030
threadpoolctl 3.1.0
torch 1.11.0
torch-geometric 2.0.5
torchvision 0.12.0
tqdm 4.64.0
traitlets 5.3.0
typer 0.4.2
typing-extensions 4.2.0
tzdata 2022.1
tzlocal 4.2
urllib3 1.26.9
wandb 0.12.16
wasabi 0.9.1
wcwidth 0.2.5
Werkzeug 2.1.2
wget 3.2
wheel 0.34.2
zipp 3.8.1
@Adeelbek Hi, could you run fedml env
to provide more context information
Hi @beiyuouo,
Thanks for your support. Actually, the problem is solved after upgrading torch version from 1.11.0 to 1.12.0+cu116. Probably, the one who directly uses docker image should double check their cuda driver version and torch compatibility. Currently, I have 8 GPUs (3090) on my server pc but when run fedml env
command, it is showing no GPU message on the terminal as follows:
======== FedML (https://fedml.ai) ========
FedML version: 0.7.210
Execution path:/usr/local/lib/python3.8/dist-packages/fedml/__init__.py
======== Running Environment ========
OS: Linux-5.4.0-117-generic-x86_64-with-glibc2.29
Hardware: x86_64
Python version: 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0]
PyTorch version: 1.12.0+cu116
MPI4py is installed
======== CPU Configuration ========
The CPU usage is : 26%
Available CPU Memory: 205.3 G / 376.5395622253418G
======== GPU Configuration ========
No GPU devices
fedml@gpusystem:/home/gpuadmin/OPD/FedML$ nvidia-smi
Tue Jul 26 00:45:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1D:00.0 Off | N/A |
| 65% 68C P2 229W / 350W | 23697MiB / 24576MiB | 40% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:1E:00.0 Off | N/A |
| 30% 37C P8 23W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:1F:00.0 Off | N/A |
| 30% 38C P8 21W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:20:00.0 Off | N/A |
| 30% 35C P8 24W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:21:00.0 Off | N/A |
| 68% 70C P2 303W / 350W | 15389MiB / 24576MiB | 68% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:22:00.0 Off | N/A |
| 76% 72C P2 312W / 350W | 15163MiB / 24576MiB | 90% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... Off | 00000000:23:00.0 Off | N/A |
| 88% 74C P2 304W / 350W | 15165MiB / 24576MiB | 88% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... Off | 00000000:24:00.0 Off | N/A |
| 71% 70C P2 312W / 350W | 15163MiB / 24576MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
Odilbek, [2022-07-26 오전 9:58]
| ID ID Usage |
|=============================================================================|
+----------------------------------------------------
Do you have any idea why it can't recognize the GPUs?
While running training, I tried to use the predefined GPUs by creating gpu_mapping.yaml
. It is actually using all GPUs in the same fashion that I have pre-assigned. However, the usage of GPU is very low like 1~3% of GPU memory. Is this normal ?
Hi there,
I trained YOLOv5 in server and client for 120 epochs. However, I haven't got any stored weights for server or client in predefined directory which is ~/object_detection/runs/
. What could be the problem?
One more thing, in ./config/fedml_config.yaml
file, I see weights are initialized as weights='none'
. Why don't we just use some pretrained weights which are publicly available in models (YOLOv5, v6, v7) github repos (like: weights='yolov5s.pt'
)?
Yes, you're right. You can use the pretrained model by setting the weights
in the config file. If you are using MLOps you can download the final model directly. But if you are in the simulation scheme, it may not have checkpoints currently
Yes, you're right. You can use the pretrained model by setting the
weights
in the config file. If you are using MLOps you can download the final model directly. But if you are in the simulation scheme, it may not have checkpoints currently
Thanks for your suggestion, I tried changing weights: "yolov5s.py", but runs/train/exp10/weights is still empty. How can I know the effect of training. Thanks!
@xierongpytorch Actually, you can enable wandb from your configuration file to see the training details while doing distributed training. At least, I was able to see the effect of training from wandb when I used their old platform which has been deleted from their previous github repo. In current object detection task, they included wandb enabling/disabling option in config/fedm_config.yaml
file defined as enable_wandb: false
. Simply, you can enable setting false to true. However, when you enable wandb option, you will come across many runtime errors which might not be solvable. So currently effect of training cannot be seen without fixing bugs or adding some codes.
I also cannot get any training metric information (mAP, AP or Recall), and wandb only have "BusyTime,...." ,command line only have mloss
@xierongpytorch Actually, you can enable wandb from your configuration file to see the training details while doing distributed training. At least, I was able to see the effect of training from wandb when I used their old platform which has been deleted from their previous github repo. In current object detection task, they included wandb enabling/disabling option in
config/fedm_config.yaml
file defined asenable_wandb: false
. Simply, you can enable setting false to true. However, when you enable wandb option, you will come across many runtime errors which might not be solvable. So currently effect of training cannot be seen without fixing bugs or adding some codes.
Thanks for the instructive advice! I can't wait to follow your suggestion and use wandb successfully, I get a lot of Time, I think I don't understand the meaning of FedCV parameters, how can I learn more parameters details. Also I would like to ask you sincerely, how to view the training weights? Many thanks!
Hi,
I build FedML platform using docker container provided by the authors. To see the performance of simple one server and one client example, I run
run_server.sh
andrun_client.sh
scripts inside the object_detection/ directory. Then, I started receiving following state in server and client terminalServer side:
Client side
I suspect that there is some problem in the connection between client and server. I am using the same PC for client and for server Any hint and suggestion is highly appreciated !!!