columncolab / EMC2

Earth Model Column Collabratory
BSD 3-Clause "New" or "Revised" License
10 stars 7 forks source link

Issue in parallel processing of large datasets #42

Closed isilber closed 3 years ago

isilber commented 3 years ago

When I try to process in parallel a large dataset (~1500 time steps; 10 subcolumns) I receive the following error causing the simulator to crash (enter what seems to be an endless loop) even if the 'chunks' option is used with rather small chunks. No issues if parallel is False.

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
rcjackson commented 3 years ago

What you have to do is place the code running EMC2 into an if/main statement like this:

if name == 'main': ds = emc2.simulator (bla bla)

Otherwise, if you try to do parallel, it will give you the error above.

isilber commented 3 years ago

Thanks! name and 'main' being? Shouldn't we implement it as default when calling dask?

rcjackson commented 3 years ago

name is the name of the procedure that you are currently in. So if you are not in the main procedure, then the code will not execute. main is the name of the default procedure that is called when your Python program first starts. Sadly, because whenever you enter a module, you exit the main procedure, there is no way to implement this line by default in the code. The only way is to warn the user to, when starting a parallel task, to ensure that this if statement surrounds their top level procedure of their code. A lot of the time, a good way to design a script with this in mind is:

Import numpy as np

def my_program(): stuff

if name == “main”: my_program()

If we put all of our code in my_program(), this will ensure that this error never pops up.

Bobby

From: isilber @.> Sent: Monday, May 10, 2021 10:49 AM To: columncolab/EMC2 @.> Cc: Jackson, Robert @.>; Comment @.> Subject: Re: [columncolab/EMC2] Issue in parallel processing of large datasets (#42)

Thanks! name and 'main' being? Shouldn't we implement it as default when calling dask?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/columncolab/EMC2/issues/42#issuecomment-836866199, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFIQA5A4EVBSGAR3S3EKD23TM757BANCNFSM44LAYNJQ.

isilber commented 3 years ago

Got it!