allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.43k stars 643 forks source link

Training take 2x longer since 1.13.0 with FastAI #1234

Open mhtrinh opened 3 months ago

mhtrinh commented 3 months ago

Training out model which is based on FastAI is taking 2x longer with Clearml 1.13.0 compare to 1.12.2

There are no error or warning

I cannot share our code. Here is the requirements.txt of the virtualenv:

absl-py==2.1.0
adal==1.2.7
adlfs==2023.8.0
aiobotocore==2.5.4
aiohttp==3.9.3
aioitertools==0.11.0
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.13.1
applicationinsights==0.11.10
argcomplete==3.1.6
async-timeout==4.0.3
attrs==23.2.0
azure-appconfiguration==1.1.1
azure-batch==14.1.0
azure-cli==2.58.0
azure-cli-core==2.58.0
azure-cli-telemetry==1.1.0
azure-common==1.1.28
azure-core==1.30.1
azure-cosmos==3.2.0
azure-data-tables==12.4.0
azure-datalake-store==0.0.53
azure-graphrbac==0.60.0
azure-identity==1.15.0
azure-keyvault-administration==4.4.0b2
azure-keyvault-certificates==4.7.0
azure-keyvault-keys==4.9.0b3
azure-keyvault-secrets==4.7.0
azure-mgmt-advisor==9.0.0
azure-mgmt-apimanagement==4.0.0
azure-mgmt-appconfiguration==3.0.0
azure-mgmt-appcontainers==2.0.0
azure-mgmt-applicationinsights==1.0.0
azure-mgmt-authorization==4.0.0
azure-mgmt-batch==17.2.0
azure-mgmt-batchai==7.0.0b1
azure-mgmt-billing==6.0.0
azure-mgmt-botservice==2.0.0
azure-mgmt-cdn==12.0.0
azure-mgmt-cognitiveservices==13.5.0
azure-mgmt-compute==30.4.0
azure-mgmt-containerinstance==10.1.0
azure-mgmt-containerregistry==10.3.0
azure-mgmt-containerservice==29.1.0
azure-mgmt-core==1.4.0
azure-mgmt-cosmosdb==9.4.0
azure-mgmt-databoxedge==1.0.0
azure-mgmt-datalake-nspkg==3.0.1
azure-mgmt-datalake-store==0.5.0
azure-mgmt-datamigration==10.0.0
azure-mgmt-devtestlabs==4.0.0
azure-mgmt-dns==8.0.0
azure-mgmt-eventgrid==10.2.0b2
azure-mgmt-eventhub==10.1.0
azure-mgmt-extendedlocation==1.0.0b2
azure-mgmt-hdinsight==9.0.0
azure-mgmt-imagebuilder==1.3.0
azure-mgmt-iotcentral==10.0.0b2
azure-mgmt-iothub==3.0.0
azure-mgmt-iothubprovisioningservices==1.1.0
azure-mgmt-keyvault==10.3.0
azure-mgmt-kusto==0.3.0
azure-mgmt-loganalytics==13.0.0b4
azure-mgmt-managedservices==1.0.0
azure-mgmt-managementgroups==1.0.0
azure-mgmt-maps==2.0.0
azure-mgmt-marketplaceordering==1.1.0
azure-mgmt-media==9.0.0
azure-mgmt-monitor==5.0.1
azure-mgmt-msi==7.0.0
azure-mgmt-netapp==10.1.0
azure-mgmt-nspkg==3.0.2
azure-mgmt-policyinsights==1.1.0b4
azure-mgmt-privatedns==1.0.0
azure-mgmt-rdbms==10.2.0b15
azure-mgmt-recoveryservices==2.5.0
azure-mgmt-recoveryservicesbackup==8.0.0
azure-mgmt-redhatopenshift==1.4.0
azure-mgmt-redis==14.3.0
azure-mgmt-resource==23.1.0b2
azure-mgmt-search==9.1.0
azure-mgmt-security==5.0.0
azure-mgmt-servicebus==8.2.0
azure-mgmt-servicefabric==1.0.0
azure-mgmt-servicefabricmanagedclusters==1.0.0
azure-mgmt-servicelinker==1.2.0b1
azure-mgmt-signalr==2.0.0b1
azure-mgmt-sql==4.0.0b15
azure-mgmt-sqlvirtualmachine==1.0.0b5
azure-mgmt-storage==21.1.0
azure-mgmt-synapse==2.1.0b5
azure-mgmt-trafficmanager==1.0.0
azure-mgmt-web==7.2.0
azure-monitor-query==1.2.0
azure-multiapi-storage==1.2.0
azure-nspkg==3.0.2
azure-storage-blob==12.19.1
azure-storage-common==1.4.2
azure-synapse-accesscontrol==0.5.0
azure-synapse-artifacts==0.18.0
azure-synapse-managedprivateendpoints==0.4.0
azure-synapse-spark==0.2.0
bcrypt==4.1.2
blis==0.7.11
botocore==1.31.17
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
clearml==1.14.2
click==8.1.7
cloudpathlib==0.16.0
colorama==0.4.6
confection==0.1.4
contextlib2==21.6.0
contourpy==1.2.0
cryptography==42.0.5
cycler==0.12.1
cymem==2.0.8
decorator==5.1.1
Deprecated==1.2.14
distro==1.9.0
fabric==3.2.2
fastai==2.7.14
fastcore==1.5.29
fastdownload==0.0.7
fastprogress==1.0.3
filelock==3.13.1
fonttools==4.49.0
frozenlist==1.4.1
fsspec==2023.6.0
furl==2.1.3
gitdb==4.0.11
GitPython==3.1.42
google-auth==2.28.1
google-auth-oauthlib==1.0.0
grpcio==1.62.0
huggingface-hub==0.21.4
humanfriendly==10.0
idna==3.6
iniconfig==2.0.0
invoke==2.2.0
isodate==0.6.1
javaproperties==0.5.2
Jinja2==3.1.3
jmespath==1.0.1
joblib==1.3.2
jsondiff==2.0.0
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
knack==0.11.0
kornia==0.7.1
lakefs-client==0.107.0
langcodes==3.3.0
Markdown==3.5.2
MarkupSafe==2.1.5
matplotlib==3.8.3
ml-collections==0.1.1
mpmath==1.3.0
msal==1.26.0
msal-extensions==1.0.0
msrest==0.7.1
msrestazure==0.6.4
multidict==6.0.5
murmurhash==1.0.10
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
onnx==1.14.0
opencv-python-headless==4.8.1.78
orderedmultidict==1.0.1
packaging==23.2
pandas==2.2.1
paramiko==3.4.0
pathlib2==2.3.7.post1
Pillow==9.5.0
pipdeptree==2.13.0
pkginfo==1.10.0
portalocker==2.8.2
preshed==3.0.9
protobuf==4.25.3
psutil==5.9.8
pyarrow==13.0.0
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycomposefile==0.0.30
pycparser==2.21
pydantic==2.6.3
pydantic_core==2.16.3
PyGithub==1.59.1
Pygments==2.17.2
PyJWT==2.4.0
PyNaCl==1.5.0
pyOpenSSL==24.0.0
pyparsing==3.1.2
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
referencing==0.33.0
requests==2.31.0
requests-oauthlib==1.3.1
rpds-py==0.18.0
rsa==4.9
s3fs==2023.6.0
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.12.0
scp==0.13.6
seaborn==0.13.2
self-supervised==1.0.4
semver==2.13.0
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
spacy==3.7.4
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
sshtunnel==0.1.5
sympy==1.12
tabulate==0.9.0
tensorboard==2.14.0
tensorboard-data-server==0.7.2
thinc==8.2.3
threadpoolctl==3.3.0
timm==0.9.16
torch==2.2.1
torchvision==0.17.1
tqdm==4.66.2
triton==2.2.0
typer==0.9.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==1.26.18
wasabi==1.1.2
weasel==0.3.4
websocket-client==1.3.3
Werkzeug==3.0.1
wrapt==1.16.0
xmltodict==0.13.0
yarl==1.9.4

Simply pip install clearml==1.12.2 and pip install clearml==1.13.0 and re-run the same code.

OS: openSUSE Leap 15.4

OS  Linux-5.14.21-150400.24.60-default-x86_64-with-glibc2.31
cpu_cores 20
gpu_count 1
gpu_driver_cuda_version 12.4
gpu_driver_version 550.54.14
gpu_memory 48GB
gpu_type NVIDIA RTX A6000
AlexandruBurlacu commented 3 months ago

Hey @mhtrinh, have you observed the same slowdown with the newer versions of ClearML? The most recent one is 1.14.4

mhtrinh commented 3 months ago

Yes, this happen also with the current version 1.14.4, as 2x slower.

Note : this may be specific to fastai as we have another network based on yolov5 and this is not happening

eugen-ajechiloae-clearml commented 3 months ago

Hi @mhtrinh ! It looks like calculating the metrics that ClearML reports may take a long time. We will try to improve performance. In the meantime, you could disable fastai bindings using auto_connect_frameworks={"fastai": False} in Task.init

eugen-ajechiloae-clearml commented 1 month ago

Hi @mhtrinh ! We will release a fix for this issue in the next clearml release clearml==1.16.0

pollfly commented 1 month ago

Hey @mhtrinh! Just letting you know that this issue has been resolved in the recently released v1.16.0. Let us know if there are any issues :)