Training does shutdown right after start

ErikSchulze1796 commented 2 years ago

Hey Fabian, I got a problem when I execute the python_job.sh in order to start a training: After I execute the script I can see the 12 trajectories are being generated in the JOB screen but they are closed right after start and the SLURM output files just say this here:

/var/tmp/slurmd_spool/job1711768/slurm_script: Zeile 13: ./Allrun.singularity: Keine Berechtigung

Do you know any fix for this? Maybe I set up the repository incorrectly..

greetings Erik

FabianGabriel commented 2 years ago

Hi Erik,

it looks like the Allrun.singularity file lacks the permission to be executed. That is something I have encountered too. You can change this with the "chmod" command. For that navigate to the base case in "DRL_py\env\base_case\agentRotatingWallVelocity" and change the permissions there. It will copy the permissions into each individual trajectory.

Greetings Fabian

ErikSchulze1796 commented 2 years ago

I suspected something like this. That did the trick for this error, although I encountered a new one ^^

cp: der Aufruf von stat für „./env/base_case/baseline_data/Re_100/processor0/4.71025“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden

I checked the directory and the folder baseline_data is not there, so do I have to run some script first to create it or is there something hardcoded that I have to change first?

Edit: Just talked to Andre and he mentioned that you edited the starting time of the training to 4.5 sec or so. As I understood I have to generate the first few seconds of the uncontrolled trajectories in order to get the training started. Do I have to run the Allrun.singularity script in the ./env/base_case/agentRotatingWallVelocity/ ?

FabianGabriel commented 2 years ago

Yes, that is indeed correct. To accelarate the training process the trajectories start at a later time from a snapshot of the simulation. The needed data was missing here. Sadly it isn't quite as easy as just copying the base_case and letting it run. You would need to make a number of changes to the setup to prevent premature starting of the control actions etc. However, I can provide you with the necessary data with this link: baseline_data(400MB) Which is also now added to the Read_me file

ErikSchulze1796 commented 2 years ago

Okay I inserted the Re_100, Re_200 and Re_400 folders in to the correct directories but my simulations still shut down. The py.log looks like this:

waiting for traj_0 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_7 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_9 ...
waiting for traj_10 ...

 starting trajectory : 0 

 starting trajectory : 1 

 starting trajectory : 2 

 starting trajectory : 3 

 starting trajectory : 4 

 starting trajectory : 5 

 starting trajectory : 6 

 starting trajectory : 7 

 starting trajectory : 8 

 starting trajectory : 9 

 starting trajectory : 10 

 starting trajectory : 11 

job :  trajectory_0 finished with rc = 0
job :  trajectory_1 finished with rc = 0
job :  trajectory_2 finished with rc = 0
job :  trajectory_3 finished with rc = 0
job :  trajectory_4 finished with rc = 0
job :  trajectory_5 finished with rc = 0
job :  trajectory_6 finished with rc = 0
job :  trajectory_7 finished with rc = 0
job :  trajectory_8 finished with rc = 0
job :  trajectory_11 finished with rc = 0
job :  trajectory_9 finished with rc = 0
job :  trajectory_10 finished with rc = 0
Traceback (most recent call last):
  File "main.py", line 124, in <module>
    action_bounds)
  File "/home/y0079256/DRL_py_beta/ppo.py", line 77, in train_model
    states, actions, rewards, returns, logpas = fill_buffer(env, sample, n_sensor, gamma, r_1, r_2, r_3, r_4, action_bounds)
  File "/home/y0079256/DRL_py_beta/reply_buffer.py", line 55, in fill_buffer
    assert n_traj > 0
AssertionError

Since it is an AssertianError, raised if the number of active trajectories is <= 0, I suppose the simulations shutdown before it reaches this line of code. Any ideas on why this happens?

darshan315 commented 2 years ago

Hello Erik,

I suspect, the assertion error is occurred due to not completing any of the simulations. Hence, Please check whether the simulation is finished correctly.

Could you please show the slurm output of a trajectory ?

Best Regards, Darshan Thummar.

ErikSchulze1796 commented 2 years ago

Hey Darshan,

the slurm output basically contain all the same:

Running blockMesh on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running setExprBoundaryFields on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running decomposePar on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running renumberMesh (4 processes) on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
-parallel
Running pimpleFoam (4 processes) on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
-parallel

What I noticed is that there is no trajectory_0 folder in /env/sample_0/trajectory_X folder.

Kind regards Erik Schulze

FabianGabriel commented 2 years ago

Hi Erik, the slurm outputs look fine. You probably need to take a look at the log files in the trajectories. If a trajectory fails it is copied over to a newly created "failed" directory in the DRL_py_beta folder. You can find them there.

Best Regards, Fabian

ErikSchulze1796 commented 2 years ago

Hey all,

I did look into the failed directory and found that all openFOAM log.* files contain this warning message:

--> FOAM Warning : 
    From void* Foam::dlLibraryTable::openLibrary(const Foam::fileName&, bool)
    in file db/dynamicLibrary/dlLibraryTable/dlLibraryTable.C at line 188
    Could not load "../../../libAgentRotatingWallVelocity.so"
../../../libAgentRotatingWallVelocity.so: cannot open shared object file: No such file or directory

I also looked into the directory and it seems that there is no libAgentRotatingWallVelocity.so file. Do I have to run the make script in the "agentRotatatingWallVelocity" folder first, in order to get the simulations running?

Kind regards Erik

AndreWeiner commented 2 years ago

Hi Erik, check out the README for the instructions to compile the boundary condition. Best, Andre

ErikSchulze1796 commented 2 years ago

Hey all,

it works now, thank you! Although I had to copy the libAgentRotatingVelocity.so file into the parent directory afterwards in order to make it work. I hope this was correct, since I guessed I could also have changed the path in the contolDict for that.

Kind regards Erik

FabianGabriel / Active_flow_control_past_cylinder_using_DRL

Training does shutdown right after start #7