bessagroup / f3dasm

Framework for Data-Driven Design & Analysis of Structures & Materials (F3DASM)
https://f3dasm.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
111 stars 29 forks source link

file error when running job in hpc cluster #223

Closed JiaxiangYi96 closed 4 months ago

JiaxiangYi96 commented 9 months ago

Hello, Martin.

When I was running my script in sequential, it can finish. Every thing works fine on my own laptop and also the hpc cluster. While, if I run the same script in parallel, some of the node would corrupt and has the following error info:

Screenshot from 2023-11-09 09-28-23

Do you know if there is error from f3dasm, by the way I am using version 1.4.4

mpvanderschelling commented 9 months ago

Hey yaga, this might be an error related to something called racing conditions. The core tries to access the dataframe, but at the moment it is empty, therefore it gets an error.

In the next update I'm trying to update the code so that this is not going to happen :)

JiaxiangYi96 commented 9 months ago

Thanks for the reply, expect to see this issue to be solved.

mpvanderschelling commented 4 months ago

I have added the following check:

          # If the lock has been acquired:
          with lock:
              tries = 0
              while tries < MAX_TRIES:
                  try:
                      self = ExperimentData.from_file(self.project_dir)
                      value = operation(self, *args, **kwargs)
                      self.store()
                      break

                  # Racing conditions can occur when the file is empty
                  # and the file is being read at the same time
                  except pd.errors.EmptyDataError:
                      tries += 1
                      logger.debug((
                          f"EmptyDataError occurred, retrying"
                          f" {tries+1}/{MAX_TRIES}"))
                      sleep(1)

                  raise pd.errors.EmptyDataError()

If an EmptyDataError is raised, we expect that the file is present (the lock has been acquired!), but the input- or outputfiles are empty. The from_file operation is repeated 10 times with a 1 second delay.