IntelLabs / matsciml

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.
MIT License
152 stars 25 forks source link

[Bug]: m3gnet_dgl example fails due to AttributeError: 'Tensor' object has no attribute 'system_embedding' #92

Closed JonathanSchmidt1 closed 10 months ago

JonathanSchmidt1 commented 10 months ago

Expected behavior

m3gnet_dgl example runs without error

Actual behavior

the example crashes during the first epoch


/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0)/charset_normalizer (3.3.2) doesn't match a supported version!
  warnings.warn(
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 10 batch(es). Logging and checkpointing is suppressed.
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py:411: UserWarning: A layer with UninitializedParameter was found. Thus, the total number of parameters detected may be inaccurate.
  warning_cache.warn(

  | Name         | Type       | Params
--------------------------------------------
0 | encoder      | M3GNet     | 273 K 
1 | loss_func    | MSELoss    | 0     
2 | output_heads | ModuleDict | 0     
--------------------------------------------
273 K     Trainable params
0         Non-trainable params
273 K     Total params
1.093     Total estimated model params size (MB)
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1595: PossibleUserWarning: The number of training batches (10) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 175.32it/s]
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/matgl/graph/data.py:286: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:261.)
  state_attrs = torch.tensor(state_attrs)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 364.18it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 501.59it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 590.58it/s]
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/dgl/backend/pytorch/tensor.py:352: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert input.numel() == input.storage().size(), "Cannot convert view " \
Traceback (most recent call last):
  File "/home/sjonathan/Downloads/matsciml_alexandria/examples/model_demos/m3gnet_dgl.py", line 23, in <module>
    trainer.fit(task, datamodule=dm)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 121, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/optim/adamw.py", line 161, in step
    loss = closure()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 107, in _wrap_closure
    closure_result = closure()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
    step_output = self._step_fn()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 933, in training_step
    loss_dict = self._compute_losses(batch)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 897, in _compute_losses
    predictions = self(batch)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 795, in forward
    outputs = self.process_embedding(embedding)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 816, in process_embedding
    output = head(embeddings.system_embedding)
AttributeError: 'Tensor' object has no attribute 'system_embedding'
Epoch 0:   0%|          | 0/20 [00:00<?, ?it/s] ```                                               

### Steps to reproduce the problem

run: python m3gnet_dgl.py

### Specifications

absl-py==1.4.0
aiohttp==3.8.5
aioitertools==0.11.0
aiosignal==1.3.1
alabaster==0.7.13
annotated-types==0.5.0
anyio==3.7.0
argon2-cffi @ file:///opt/conda/conda-bld/argon2-cffi_1645000214183/work
argon2-cffi-bindings @ file:///tmp/build/80754af9/argon2-cffi-bindings_1644553347904/work
arrow==1.2.3
ase==3.22.1
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.3
atomate==1.0.3
atomate2 @ file:///home/sjonathan/Downloads/atomate2_other_forcefields
attrs==23.1.0
Babel @ file:///croot/babel_1671781930836/work
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
bandit==1.7.6
bcrypt==4.0.1
beautifulsoup4==4.12.2
biopython==1.81
black==23.12.1
bleach==6.0.0
boto3==1.28.4
botocore==1.31.4
bracex==2.3.post1
brotlipy==0.7.0
cachelib==0.9.0
cachetools==5.3.0
castepxbin==0.2.0
cclib==1.8
CellConstructor==1.3.2
certifi @ file:///croot/certifi_1671487769961/work/certifi
cffi @ file:///croot/cffi_1670423208954/work
cfgv==3.4.0
chardet==5.2.0
charset-normalizer==3.3.2
chemview==0.6
chgnet==0.2.0
click==8.1.7
cloudpickle==3.0.0
colorama==0.4.6
colormath==3.0.0
comm==0.1.3
contextlib2==21.6.0
contourpy==1.0.7
cryptography @ file:///croot/cryptography_1677533068310/work
crystal-toolkit==2023.6.1
crystaltoolkit-extension==0.6.0
custodian==2023.7.22
cycler==0.11.0
Cython==3.0.2
dash==2.10.2
dash-core-components==2.0.0
dash-html-components==2.0.0
dash-mp-components==0.4.34
dash-table==5.0.0
debugpy==1.6.7
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
defusedxml @ file:///tmp/build/80754af9/defusedxml_1615228127516/work
dgl==0.9.1
dgllife==0.3.2
distlib==0.3.8
dnspython==2.3.0
docstring-parser==0.15
docutils==0.20.1
dpdata==0.2.15
dscribe==2.1.0
e3nn==0.5.1
einops==0.7.0
email-validator==2.1.0.post1
emmet-core==0.64.0
entrypoints @ file:///tmp/build/80754af9/entrypoints_1649908313000/work
exceptiongroup==1.1.1
executing==1.2.0
f90wrap==0.2.13
fabric==3.1.0
fastapi==0.100.0
fastcore==1.5.29
fasteners==0.18
fastjsonschema==2.17.1
fforces==0.1
filelock==3.12.2
FireWorks==2.0.3
flake8==7.0.0
flake8-bandit==4.1.1
flake8-black==0.3.6
Flake8-pyproject==1.2.3
Flask==2.2.5
Flask-Caching==2.0.2
flask-paginate==2022.1.8
flatbuffers==23.3.3
flit_core @ file:///opt/conda/conda-bld/flit-core_1644941570762/work/source/flit_core
fonttools==4.39.0
fqdn==1.5.1
frozenlist==1.4.0
fsspec==2023.9.2
future==0.18.3
gast==0.4.0
gdown==4.7.1
geometric-algebra-attention==0.5.1
gitdb==4.0.11
GitPython==3.1.41
google-auth==2.17.1
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
greenlet==3.0.3
GridDataFormats==1.0.1
grpcio==1.53.0
gunicorn==20.1.0
h11==0.14.0
h5py==3.8.0
hiphive==1.1
httpcore==1.0.2
httpx==0.26.0
hyperopt==0.2.7
identify==2.5.33
idna @ file:///croot/idna_1666125576474/work
imageio==2.31.0
imagesize==1.4.1
importlib-metadata==6.8.0
importlib-resources==6.1.1
inflect==6.0.4
iniconfig==2.0.0
invoke==2.1.3
ipykernel==6.23.2
ipython==8.14.0
ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work
ipywidgets==7.7.5
isoduration==20.11.0
itsdangerous==2.1.2
jax==0.4.8
jedi==0.18.2
Jinja2 @ file:///croot/jinja2_1666908132255/work
jmespath==1.0.1
jobflow==0.1.13
joblib==1.2.0
json5 @ file:///tmp/build/80754af9/json5_1624432770122/work
jsonargparse==4.27.1
jsonpointer==2.3
jsonschema @ file:///croot/jsonschema_1676558650973/work
julia==0.6.1
jupyter @ file:///tmp/abs_33h4eoipez/croots/recipe/jupyter_1659349046347/work
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.2.0
jupyter_core==5.3.1
jupyter_server==2.6.0
jupyter_server_terminals==0.4.4
jupyterlab @ file:///croot/jupyterlab_1675354114448/work
jupyterlab-pygments==0.2.2
jupyterlab-widgets==1.1.4
jupyterlab_server @ file:///croot/jupyterlab_server_1677143054853/work
kaleido==0.2.1
keras==2.12.0
kiwisolver==1.4.4
lark==1.1.8
latexcodec==2.0.1
lazy_loader==0.2
libclang==16.0.0
lightning-utilities==0.9.0
llvmlite==0.39.1
lmdb==1.3.0
lobsterpy==0.3.0
lovely-numpy==0.2.8
lxml @ file:///opt/conda/conda-bld/lxml_1657545139709/work
mace @ file:///home/sjonathan/Downloads/mace
maggma==0.56.0
Markdown==3.4.3
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matgl==0.8.5
matminer==0.8.0
matplotlib==3.7.1
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
-e git+https://github.com/JonathanSchmidt1/matsciml_alexandria.git@9568e18f0d3546cbd565a87655a88bff9bf45d42#egg=matsciml
matscipy==0.8.0
mccabe==0.7.0
MDAnalysis==2.6.0
mdtraj==1.9.7
mdurl==0.1.2
mendeleev==0.14.0
mistune==2.0.5
ml-dtypes==0.0.4
mmtf-python==1.1.3
mongogrant==0.3.3
mongomock==4.1.2
monty==2023.9.5
mp-api==0.33.3
mpi4py==3.1.4
mpmath==1.3.0
mrcfile==1.4.3
msgpack==1.0.5
multidict==6.0.4
munch==2.5.0
mypy-extensions==1.0.0
nbclassic==1.0.0
nbclient==0.8.0
nbconvert==7.5.0
nbformat==5.9.0
nequip @ file:///home/sjonathan/fusessh/dgx3/nequip2/nequip3/nequip
nest-asyncio @ file:///croot/nest-asyncio_1672387112409/work
networkx==3.0
nglview==3.0.5
nodeenv==1.8.0
notebook==6.5.4
notebook_shim==0.2.3
numba==0.56.4
numpy==1.23.5
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py3==7.352.0
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
openai==0.28.1
opt-einsum==3.3.0
opt-einsum-fx==0.1.4
optimade==1.0.1
orjson==3.9.2
overrides==7.3.1
packaging==23.1
palettable==3.3.0
pandas==1.5.3
pandocfilters @ file:///opt/conda/conda-bld/pandocfilters_1643405455980/work
paramiko==3.2.0
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathspec==0.12.1
pbr==6.0.0
pdyna @ file:///home/sjonathan/Downloads/PDynA
periodictable==1.6.1
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
phonopy==2.20.0
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.4.0
platformdirs==4.1.0
plotly==5.13.1
pluggy==1.0.0
ply==3.11
pooch==1.7.0
pre-commit==3.6.0
prettytable==3.7.0
prometheus-client==0.17.0
prompt-toolkit==3.0.38
protobuf==4.22.1
psutil==5.9.5
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
PubChemPy==1.0.4
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
py4j==0.10.9.7
py4vasp==0.7.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.11.1
pybtex==0.24.0
pycodestyle==2.11.1
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==1.10.12
pydantic-settings==2.0.3
pydantic_core==2.14.6
pydash==7.0.5
pyfiglet==0.8.post1
pyflakes==3.2.0
Pygments==2.15.1
pymatgen==2023.7.20
pymatgen-analysis-defects==2023.8.22
pymatgen-analysis-diffusion==2022.7.21
pymatgen-db==2023.2.23
pymongo==4.3.3
PyNaCl==1.5.0
pynndescent==0.5.8
pyOpenSSL @ file:///croot/pyopenssl_1677607685877/work
pyparsing==3.0.9
PyProcar==6.0.0
PyQt5-sip==12.11.0
pyrsistent==0.19.3
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
pysr==0.12.0
pytest==7.3.2
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-dotenv==1.0.0
python-json-logger==2.0.7
python-sscha==1.3.2.1
pytorch-lightning==1.8.6
pytz==2022.7.1
pyvista==0.40.1
PyWavelets==1.4.1
PyYAML==6.0.1
pyzmq==24.0.1
qtconsole==5.4.3
QtPy==2.3.1
quippy-ase==0.9.14
rdkit==2023.3.1
requests==2.28.2
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.0
robocrys==0.2.8
rowan==1.3.0.post1
rsa==4.9
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.7
s3transfer==0.6.1
schema==0.7.5
scikit-image==0.21.0
scikit-learn==1.0
scipy==1.10.1
scooby==0.7.2
seaborn==0.12.2
seekpath==2.1.0
Send2Trash==1.8.2
sentinels==1.0.0
sentry-sdk==1.26.0
shakenbreak==3.0.0
shapely==2.0.1
sip @ file:///tmp/abs_44cd77b_pu/croots/recipe/sip_1659012365470/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sniffio==1.3.0
snowballstemmer==2.2.0
soupsieve==2.4.1
sparse==0.14.0
spglib==2.0.2
Sphinx==7.2.6
sphinx-argparse==0.4.0
sphinx-pdj-theme==0.2.1
sphinxcontrib-applehelp==1.0.7
sphinxcontrib-devhelp==1.0.5
sphinxcontrib-htmlhelp==2.0.4
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.6
sphinxcontrib-serializinghtml==1.1.9
SQLAlchemy==2.0.25
sshtunnel==0.4.0
stack-data==0.6.2
starlette==0.27.0
stevedore==5.1.0
sumo==2.3.5
sympy==1.11.1
tabulate==0.9.0
tdscha==1.0.1
tenacity==8.2.2
tensorboard==2.12.1
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tensorboardX==2.6.2.2
tensorflow==2.12.0
tensorflow-estimator==2.12.0
tensorflow-io-gcs-filesystem==0.32.0
termcolor==2.2.0
terminado @ file:///croot/terminado_1671751832461/work
threadpoolctl==3.1.0
tifffile==2023.4.12
tinycss2 @ file:///croot/tinycss2_1668168815555/work
toml @ file:///tmp/build/80754af9/toml_1616166611790/work
tomli @ file:///opt/conda/conda-bld/tomli_1657175507142/work
torch==2.1.2
torch-cluster==1.6.3
torch-ema==0.3
torch-runstats==0.2.0
torch-scatter==2.1.2
torch-sparse==0.6.18
torchmetrics==1.2.0
tornado==6.3.2
tqdm==4.65.0
trainstation==1.0
traitlets==5.9.0
trimesh==3.22.3
triton==2.1.0
typeshed-client==2.4.0
typing==3.7.4.3
typing_extensions==4.8.0
umap-learn==0.5.3
uncertainties==3.1.7
uri-template==1.2.0
urllib3==2.1.0
uvicorn==0.23.2
Vapory==0.1.2
virtualenv==20.25.0
vtk==9.2.6
wcmatch==8.4.1
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.3
Werkzeug==2.2.3
widgetsnbextension==3.6.4
wrapt==1.14.1
yarl==1.9.2
zipp==3.16.2
melo-gonzo commented 10 months ago

Thanks for submitting this issue! We'll have this fixed in an upcoming PR.

JonathanSchmidt1 commented 10 months ago

I still have some problems getting matgl to run. To get it to run on multiple gpus I had to change some pytorch code to force some generators to be on the gpu however I cannot get it to run with multiple workers for the dataloading this way. Do you manage to train matgl on multiple gpus using a non-zero worker number? If yes which torch, dgl, matgl versions are you using?

laserkelvin commented 10 months ago

@JonathanSchmidt1 can you open a new issue, and share your script/code changes you needed to make in order to make it run? In the issue, please share the error messages as well.

@melo-gonzo and I can help diagnose.

melo-gonzo commented 10 months ago

@JonathanSchmidt1 Thanks for bringing this up. This is an issue I have come across for me as well, and there's a few ways I've been able to get around it. For starters, the matgl team as recommended adding torch.set_default_device("cuda") at the top of m3gnet training scripts, so we're putting together a PR to include that in the example. Unfortunately, I have not been able to resolve the num_workers issue using this method.

To get training running with multiple gpu's and num_workers>0 I had to create a new personal branch and add in some tensor placement calls where needed. Going this route lets gpu training run without extra calls (torch.set_default_device("cuda")) and lets you set num_workers.

Happy to keep looking for more solutions. @laserkelvin may have some idea's to try out as well.