Training MDM - Githubissues

rainofmine commented 1 year ago

I tried to train MDM on HumanML3D with the provided training script but the loss shows Nan. And the predicted result is not correct. Is anything wrong?

By the way, error occurs when running training with --eval_during_training or --train_platform_type {ClearmlPlatform, TensorboardPlatform}.

sigal-raab commented 1 year ago

Hi @rainofmine,

Can you please provide the exact command line that you have used?
Can you please provide the standard output of the training? Also: The evaluation (induced by eval_during_training) takes ~90 minutes. During those 90 minutes, no loss is outputted. However, the loss at iteration 0 should be outputted before evaluation and should not be Nan.

rainofmine commented 1 year ago

@sigal-raab I just run the provided script. python -m train.train_mdm --save_dir save/my_humanml_trans_enc_512 --dataset humanml

The training log

sigal-raab commented 1 year ago

@rainofmine, Thank you for the information. According to the command-line above, the nan loss happens also without --eval_during_training or --train_platform_type {ClearmlPlatform, TensorboardPlatform}. Is that correct? We cannot reconstruct the problem in our environment so I'd like to find out the differences:

What operating system are you using and which version?
What cuda version are you using?
Please send us the details regarding the conda environment in which you run the training: run "conda env export > environment.yml" and then attach the file environment.yml to your answer.

rainofmine commented 1 year ago

@sigal-raab

Ubuntu 18.04
cuda 10.1
Pytorch 1.7.1

The environment:

name: mdm channels:

pytorch
conda-forge
defaults dependencies:
_libgcc_mutex=0.1=main
_openmp_mutex=5.1=1_gnu
beautifulsoup4=4.11.1=pyha770c72_0
blas=1.0=mkl
brotlipy=0.7.0=py37h540881e_1004
ca-certificates=2022.9.24=ha878542_0
catalogue=2.0.8=py37h89c1867_0
certifi=2022.9.24=pyhd8ed1ab_0
cffi=1.15.1=py37h74dc2b5_0
charset-normalizer=2.1.1=pyhd8ed1ab_0
colorama=0.4.5=pyhd8ed1ab_0
cryptography=35.0.0=py37hf1a17b8_2
cudatoolkit=11.0.221=h6bb024c_0
cycler=0.11.0=pyhd3eb1b0_0
cymem=2.0.6=py37hd23a5d3_3
dataclasses=0.8=pyhc8e2a94_3
dbus=1.13.18=hb2f20db_0
expat=2.4.9=h6a678d5_0
fftw=3.3.9=h27cfd23_1
filelock=3.8.0=pyhd8ed1ab_0
fontconfig=2.13.1=h6c09931_0
freetype=2.11.0=h70c0345_0
gdown=4.5.1=pyhd8ed1ab_0
giflib=5.2.1=h7b6447c_0
glib=2.69.1=h4ff587b_1
gst-plugins-base=1.14.0=h8213a91_2
gstreamer=1.14.0=h28cd5cc_2
h5py=3.7.0=py37h737f45e_0
hdf5=1.10.6=h3ffc7dd_1
icu=58.2=he6710b0_3
idna=3.4=pyhd8ed1ab_0
intel-openmp=2021.4.0=h06a4308_3561
jinja2=3.1.2=pyhd8ed1ab_1
jpeg=9b=h024ee3a_2
kiwisolver=1.4.2=py37h295c915_0
langcodes=3.3.0=pyhd8ed1ab_0
lcms2=2.12=h3be6417_0
ld_impl_linux-64=2.38=h1181459_1
libffi=3.3=he6710b0_2
libgcc-ng=11.2.0=h1234567_1
libgfortran-ng=11.2.0=h00389a5_1
libgfortran5=11.2.0=h1234567_1
libgomp=11.2.0=h1234567_1
libpng=1.6.37=hbc83047_0
libstdcxx-ng=11.2.0=h1234567_1
libtiff=4.1.0=h2733197_1
libuuid=1.0.3=h7f8727e_2
libuv=1.40.0=h7b6447c_0
libwebp=1.2.0=h89dd481_0
libxcb=1.15=h7f8727e_0
libxml2=2.9.14=h74e7548_0
lz4-c=1.9.3=h295c915_1
markupsafe=2.1.1=py37h540881e_1
matplotlib=3.1.3=py37_0
matplotlib-base=3.1.3=py37hef1b27d_0
mkl=2021.4.0=h06a4308_640
mkl-service=2.4.0=py37h7f8727e_0
mkl_fft=1.3.1=py37hd3c417c_0
mkl_random=1.2.2=py37h51133e4_0
ncurses=6.3=h5eee18b_3
ninja=1.10.2=h06a4308_5
ninja-base=1.10.2=hd09550d_5
numpy=1.21.5=py37h6c91a56_3
numpy-base=1.21.5=py37ha15fc14_3
openssl=1.1.1q=h7f8727e_0
packaging=21.3=pyhd8ed1ab_0
pathy=0.6.2=pyhd8ed1ab_0
pcre=8.45=h295c915_0
pillow=9.2.0=py37hace64e9_1
pip=22.2.2=py37h06a4308_0
pycparser=2.21=pyhd8ed1ab_0
pydantic=1.8.2=py37h5e8e339_2
pyopenssl=22.0.0=pyhd8ed1ab_1
pyparsing=3.0.9=py37h06a4308_0
pyqt=5.9.2=py37h05f1152_2
pysocks=1.7.1=py37h89c1867_5
python=3.7.13=h12debd9_0
python-dateutil=2.8.2=pyhd3eb1b0_0
python_abi=3.7=2_cp37m
pytorch=1.7.1=py3.7_cuda11.0.221_cudnn8.0.5_0
qt=5.9.7=h5867ecd_1
readline=8.1.2=h7f8727e_1
requests=2.28.1=pyhd8ed1ab_1
scipy=1.7.3=py37h6c91a56_2
setuptools=63.4.1=py37h06a4308_0
shellingham=1.5.0=pyhd8ed1ab_0
sip=4.19.8=py37hf484d3e_0
six=1.16.0=pyhd3eb1b0_1
smart_open=5.2.1=pyhd8ed1ab_0
soupsieve=2.3.2.post1=pyhd8ed1ab_0
spacy=3.3.1=py37h79cecc1_0
spacy-legacy=3.0.10=pyhd8ed1ab_0
spacy-loggers=1.0.3=pyhd8ed1ab_0
sqlite=3.39.3=h5082296_0
tk=8.6.12=h1ccaba5_0
torchaudio=0.7.2=py37
torchvision=0.8.2=py37_cu110
tornado=6.2=py37h5eee18b_0
tqdm=4.64.1=py37h06a4308_0
trimesh=3.15.3=pyh1a96a4e_0
typer=0.4.2=pyhd8ed1ab_0
wheel=0.37.1=pyhd3eb1b0_0
xz=5.2.6=h5eee18b_0
zipp=3.8.1=pyhd8ed1ab_0
zlib=1.2.12=h5eee18b_3
zstd=1.4.9=haebb681_0
pip:
- attrs==22.1.0
- blis==0.7.8
- blobfile==2.0.0
- chumpy==0.70
- clearml==1.7.1
- click==8.1.3
- clip==1.0
- confection==0.0.2
- en-core-web-sm==3.3.0
- ftfy==6.1.1
- furl==2.1.3
- future==0.18.2
- importlib-metadata==5.0.0
- importlib-resources==5.10.0
- jsonschema==4.16.0
- lxml==4.9.1
- murmurhash==1.0.8
- orderedmultidict==1.0.1
- pathlib2==2.3.7.post1
- pkgutil-resolve-name==1.3.10
- preshed==3.0.7
- psutil==5.9.2
- pycryptodomex==3.15.0
- pyjwt==2.4.0
- pyrsistent==0.18.1
- pyyaml==6.0
- regex==2022.9.13
- smplx==0.1.28
- srsly==2.4.4
- thinc==8.0.17
- typing-extensions==4.1.1
- urllib3==1.26.12
- wasabi==0.10.1
- wcwidth==0.2.5

sigal-raab commented 1 year ago

@rainofmine, Here are my environment details:

Ubuntu 18.04
Cuda 11.1
Pytorch 1.12.1+cu102
packages as in the environment.yml file in this repo.

From a comparison with your env, I guess the greatest difference is the Cuda version. 10.1 is old. Can you upgrade to Cuda 11.1 (or newer) and let me know whether the problem is solved? If this does not help, then try adjusting the pytorch version. Also, please let me know which python version you are using. Our code has been tested on python 3.7. Next in the order of priorities are the pip packages versions which are slightly different than the ones in our environment.yml file.

sigal-raab commented 1 year ago

@rainofmine, I noticed that you closed the issue. Is it because it was solved? If so, what was the solution? I am asking because maybe others will encounter the same problem and your answer will help them.

Ying156209 commented 1 year ago

Hi， I encounter the same problem as you, did you find any solution? Thx @rainofmine

Kai-0515 commented 1 year ago

Hi， I encounter the same problem as you, did you find any solution? Thx @rainofmine The same question, how do you solve it?Thx!

Kai-0515 commented 1 year ago

@sigal-raab

Ubuntu 18.04

cuda 10.1

Pytorch 1.7.1

The environment:

name: mdm channels:

pytorch

conda-forge

defaults dependencies:

_libgcc_mutex=0.1=main

_openmp_mutex=5.1=1_gnu

beautifulsoup4=4.11.1=pyha770c72_0

blas=1.0=mkl

brotlipy=0.7.0=py37h540881e_1004

ca-certificates=2022.9.24=ha878542_0

catalogue=2.0.8=py37h89c1867_0

certifi=2022.9.24=pyhd8ed1ab_0

cffi=1.15.1=py37h74dc2b5_0

charset-normalizer=2.1.1=pyhd8ed1ab_0

colorama=0.4.5=pyhd8ed1ab_0

cryptography=35.0.0=py37hf1a17b8_2

cudatoolkit=11.0.221=h6bb024c_0

cycler=0.11.0=pyhd3eb1b0_0

cymem=2.0.6=py37hd23a5d3_3

dataclasses=0.8=pyhc8e2a94_3

dbus=1.13.18=hb2f20db_0

expat=2.4.9=h6a678d5_0

fftw=3.3.9=h27cfd23_1

filelock=3.8.0=pyhd8ed1ab_0

fontconfig=2.13.1=h6c09931_0

freetype=2.11.0=h70c0345_0

gdown=4.5.1=pyhd8ed1ab_0

giflib=5.2.1=h7b6447c_0

glib=2.69.1=h4ff587b_1

gst-plugins-base=1.14.0=h8213a91_2

gstreamer=1.14.0=h28cd5cc_2

h5py=3.7.0=py37h737f45e_0

hdf5=1.10.6=h3ffc7dd_1

icu=58.2=he6710b0_3

idna=3.4=pyhd8ed1ab_0

intel-openmp=2021.4.0=h06a4308_3561

jinja2=3.1.2=pyhd8ed1ab_1

jpeg=9b=h024ee3a_2

kiwisolver=1.4.2=py37h295c915_0

langcodes=3.3.0=pyhd8ed1ab_0

lcms2=2.12=h3be6417_0

ld_impl_linux-64=2.38=h1181459_1

libffi=3.3=he6710b0_2

libgcc-ng=11.2.0=h1234567_1

libgfortran-ng=11.2.0=h00389a5_1

libgfortran5=11.2.0=h1234567_1

libgomp=11.2.0=h1234567_1

libpng=1.6.37=hbc83047_0

libstdcxx-ng=11.2.0=h1234567_1

libtiff=4.1.0=h2733197_1

libuuid=1.0.3=h7f8727e_2

libuv=1.40.0=h7b6447c_0

libwebp=1.2.0=h89dd481_0

libxcb=1.15=h7f8727e_0

libxml2=2.9.14=h74e7548_0

lz4-c=1.9.3=h295c915_1

markupsafe=2.1.1=py37h540881e_1

matplotlib=3.1.3=py37_0

matplotlib-base=3.1.3=py37hef1b27d_0

mkl=2021.4.0=h06a4308_640

mkl-service=2.4.0=py37h7f8727e_0

mkl_fft=1.3.1=py37hd3c417c_0

mkl_random=1.2.2=py37h51133e4_0

ncurses=6.3=h5eee18b_3

ninja=1.10.2=h06a4308_5

ninja-base=1.10.2=hd09550d_5

numpy=1.21.5=py37h6c91a56_3

numpy-base=1.21.5=py37ha15fc14_3

openssl=1.1.1q=h7f8727e_0

packaging=21.3=pyhd8ed1ab_0

pathy=0.6.2=pyhd8ed1ab_0

pcre=8.45=h295c915_0

pillow=9.2.0=py37hace64e9_1

pip=22.2.2=py37h06a4308_0

pycparser=2.21=pyhd8ed1ab_0

pydantic=1.8.2=py37h5e8e339_2

pyopenssl=22.0.0=pyhd8ed1ab_1

pyparsing=3.0.9=py37h06a4308_0

pyqt=5.9.2=py37h05f1152_2

pysocks=1.7.1=py37h89c1867_5

python=3.7.13=h12debd9_0

python-dateutil=2.8.2=pyhd3eb1b0_0

python_abi=3.7=2_cp37m

pytorch=1.7.1=py3.7_cuda11.0.221_cudnn8.0.5_0

qt=5.9.7=h5867ecd_1

readline=8.1.2=h7f8727e_1

requests=2.28.1=pyhd8ed1ab_1

scipy=1.7.3=py37h6c91a56_2

setuptools=63.4.1=py37h06a4308_0

shellingham=1.5.0=pyhd8ed1ab_0

sip=4.19.8=py37hf484d3e_0

six=1.16.0=pyhd3eb1b0_1

smart_open=5.2.1=pyhd8ed1ab_0

soupsieve=2.3.2.post1=pyhd8ed1ab_0

spacy=3.3.1=py37h79cecc1_0

spacy-legacy=3.0.10=pyhd8ed1ab_0

spacy-loggers=1.0.3=pyhd8ed1ab_0

sqlite=3.39.3=h5082296_0

tk=8.6.12=h1ccaba5_0

torchaudio=0.7.2=py37

torchvision=0.8.2=py37_cu110

tornado=6.2=py37h5eee18b_0

tqdm=4.64.1=py37h06a4308_0

trimesh=3.15.3=pyh1a96a4e_0

typer=0.4.2=pyhd8ed1ab_0

wheel=0.37.1=pyhd3eb1b0_0

xz=5.2.6=h5eee18b_0

zipp=3.8.1=pyhd8ed1ab_0

zlib=1.2.12=h5eee18b_3

zstd=1.4.9=haebb681_0

pip:

attrs==22.1.0

blis==0.7.8

blobfile==2.0.0

chumpy==0.70

clearml==1.7.1

click==8.1.3

clip==1.0

confection==0.0.2

en-core-web-sm==3.3.0

ftfy==6.1.1

furl==2.1.3

future==0.18.2

importlib-metadata==5.0.0

importlib-resources==5.10.0

jsonschema==4.16.0

lxml==4.9.1

murmurhash==1.0.8

orderedmultidict==1.0.1

pathlib2==2.3.7.post1

pkgutil-resolve-name==1.3.10

preshed==3.0.7

psutil==5.9.2

pycryptodomex==3.15.0

pyjwt==2.4.0

pyrsistent==0.18.1

pyyaml==6.0

regex==2022.9.13

smplx==0.1.28

srsly==2.4.4

thinc==8.0.17

typing-extensions==4.1.1

urllib3==1.26.12

wasabi==0.10.1

wcwidth==0.2.5

Hi, I wonder how do u solve this problem? Looking forward to your reply, thx!

sigal-raab commented 1 year ago

@Kai-0515 , my advise to @rainofmine was to upgrade his Cuda version, to 11.1 or higher. He did not reply, so I don't know if it worked for him. However, he closed the issue, which may indicate of a happy solution. I re-opened the issue due to your question. If your Cuda version is relatively old, will you try installing a newer one and report the results?

Kai-0515 commented 1 year ago

@Kai-0515 , my advise to @rainofmine was to upgrade his Cuda version, to 11.1 or higher. He did not reply, so I don't know if it worked for him. However, he closed the issue, which may indicate of a happy solution. I re-opened the issue due to your question. If your Cuda version is relatively old, will you try installing a newer one and report the results?

Thx for your reply, my cuda version is 11.4 and other settings are the same as your provides. I also try to minimize the batchsize to 8, but the loss is still NAN. For the humanml3d dataset, I process as they announced and evaluate, the dataset should be good. I wonder if there any other settings in the code which may have influences on results?

sigal-raab commented 1 year ago

@Kai-0515, do you encounter the same problem when working with humanact12 or uestc? Even if those are not the datasets you want to work with, your answer may help us figure out the cause of this issue.

Kai-0515 commented 1 year ago

@Kai-0515, do you encounter the same problem when working with humanact12 or uestc? Even if those are not the datasets you want to work with, your answer may help us figure out the cause of this issue.

I will have a try

Kai-0515 commented 1 year ago

@Kai-0515, do you encounter the same problem when working with humanact12 or uestc? Even if those are not the datasets you want to work with, your answer may help us figure out the cause of this issue.

I find the problem, some data in humanml3d is broken while I evaluate it using the method humanml3d provides. The broken data is between 3000-5000, I clean it and have the right result.

sigal-raab commented 1 year ago

@Kai-0515 , I am glad you can work now. @GuyTevet , can you double-check the data?

GuyTevet commented 1 year ago

@Kai-0515 can you please open an issue in https://github.com/EricGuo5513/HumanML3D ? It is possible that there is a bug in the data pre-processing.

ShungJhon commented 1 year ago

@Kai-0515 Hi, I encountered the same problem. How did you find the broken data?

ShungJhon commented 1 year ago

I found out the broken data for me is 004355.npy and M004355.npy, using cal_mean_variance.ipynb provided by HumanMl3d. After removing them, the loss is normally computed.

qiqiApink commented 1 year ago

I found out the broken data for me is 004355.npy and M004355.npy, using cal_mean_variance.ipynb provided by HumanMl3d. After removing them, the loss is normally computed.

@ShungJhon Did you mean to remove the broken data from the train_set or val_set list file?

ShungJhon commented 1 year ago

@qiqiApink The new_joints and new_joints_vecs files in dataset/HumanML3D/

GuyTevet / motion-diffusion-model

Training MDM #18