NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
1.05k stars 249 forks source link

🐛[BUG]: aero_graph_net failed to load dataset #710

Open willyawan16 opened 1 week ago

willyawan16 commented 1 week ago

Version

0.8.0

On which installation method(s) does this occur?

No response

Describe the issue

Failed to load dataset when trying to train aero_graph_net. Is there any way to fix this?

it stuck in the hydra instantiation as shown in the error log.

Minimum reproducible example

Relevant log output

[18:55:46 - agnet - INFO] Loading the training dataset...
Error executing job with overrides: ['+experiment=ahmed/mgn', 'data.data_dir=./data/ahmed_body']
concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/process.py", line 392, in wait_result_broken_or_wakeup
    result_item = result_reader.recv()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 496, in rebuild_storage_fd
    fd = df.detach()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target
    return _target_(*args, **kwargs)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/modulus/datapipes/gnn/ahmed_body_dataset.py", line 219, in __init__
    for (i, graph, coeff, normal, area) in executor.map(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 267, in <module>
    main()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 219, in main
    trainer = MGNTrainer(cfg)
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 54, in __init__
    self.dataset = instantiate(cfg.data.train)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 226, in instantiate
    return instantiate_node(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 347, in instantiate_node
    return _call_target(_target_, partial, args, kwargs, full_key)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 97, in _call_target
    raise InstantiationException(msg) from e
hydra.errors.InstantiationException: Error in call to target 'modulus.datapipes.gnn.ahmed_body_dataset.AhmedBodyDataset':
BrokenProcessPool('A process in the process pool was terminated abruptly while the future was running or pending.')
full_key: data.train

Environment details

Alexey-Kamenev commented 1 week ago

Can you please double check that the path in data.data_dir=./data/ahmed_body is correct? Maybe try using absolute path just to check?

Also, try limiting number of dataset pre-fetching workers: data.train.num_workers=1

Finally, see if the example works with reduced dataset, for example, to use only 2 train samples: data.train.num_samples=2

willyawan16 commented 1 week ago

HYDRA_FULL_ERROR=1 python train.py +experiment=ahmed/mgn data.data_dir=/home/willy/modulus/modulus/examples/cfd/aero_graph_net/data/ahmed_body data.train.num_workers=1 data.val.num_workers=1 data.test.num_workers=1 data.train.num_samples=10 data.val.num_samples=5 data.test.num_samples=5

I changed my command as the above, and it passed the dataset loading problem. But why when I try to change the num_samples higher than that, it returns the same error?

Alexey-Kamenev commented 1 week ago

So anything greater than 10 in data.train.num_samples causes that error to appear? From the error itself, it looks like something happens during dataset pre-loading in one of the graph loading processes. Unfortunately, I could not reproduce the issue on my side.

You can try adding some simple prints to create_graph function to see if there is a particular file or place where the error occurs (and keep num_workers=1 to simplify the debugging).

Also, which environment does this issue happen in?