bessagroup / f3dasm

Framework for Data-Driven Design & Analysis of Structures & Materials (F3DASM)
https://f3dasm.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
111 stars 29 forks source link

Failure to wait for the data_object creation #218

Closed SNMS95 closed 4 months ago

SNMS95 commented 10 months ago

When I run a DOE with f3dasm, sometimes, a few nodes produce the following error and quit.

Error executing job with overrides: ['++hpc.jobid=4', 'hp_tune.model=baseline', 'hp_tune.model_seed=-1']
Traceback (most recent call last):
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 271, in _from_file_attempt
    domain = Domain.from_file(Path(f"{filename}_domain"))
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/domain.py", line 71, in from_file
    raise FileNotFoundError(f"Domain file {filename} does not exist.")
FileNotFoundError: Domain file exp_data_baseline_domain does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 145, in from_file
    return cls._from_file_attempt(filename, text_io)
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 283, in _from_file_attempt
    raise FileNotFoundError(f"Cannot find the file {filename}_data.csv.")
FileNotFoundError: Cannot find the file exp_data_baseline_data.csv.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 271, in _from_file_attempt
    domain = Domain.from_file(Path(f"{filename}_domain"))
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/domain.py", line 71, in from_file
    raise FileNotFoundError(f"Domain file {filename} does not exist.")
FileNotFoundError: Domain file /gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/exp_data_baseline_domain does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/main.py", line 80, in main_func
    process(config)
  File "/gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/main.py", line 62, in process
    data = f3dasm.ExperimentData.from_file(filename='exp_data_{}'.format(
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 152, in from_file
    return cls._from_file_attempt(filename_with_path, text_io)
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 283, in _from_file_attempt
    raise FileNotFoundError(f"Cannot find the file {filename}_data.csv.")
FileNotFoundError: Cannot find the file /gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/exp_data_baseline_data.csv.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I am using version 1.3.0.

mpvanderschelling commented 10 months ago

Hey Surya,

Can you share more of the code you are using?

SNMS95 commented 10 months ago

Hey Martin,

I will try to condense it into a MRE and post it here

mpvanderschelling commented 9 months ago

The core issue is related to #223 : racing conditions while opening the ExperimentData file while having multiple processes.

In the next update I'm trying to fix this !