Problem in setting up this DRL-part

FabianGabriel commented 3 years ago

Hi Darshan,

I tried to follow your instructions of how to set up the DRL-part of this repository. I came across multiple little problems:

You wrote this command: module load python / 3.7 the blankspaces are wrong there. It should be: module load python/3.7
You made a typo on this command: pip install -r ./DRL_py/docker/requirement.txt. It should be pip install -r ./DRLpy/docker/requirements_.txt
Another typo in the PPO iterations part: triaing

And one big problem: After I executed all these commands I wanted to submit the training job on the cluster with this command: sbatch python_job.sh First of all that should be sbatch ./DRL_py/python_job.sh After I changed it the slurm manager put this out /var/tmp/slurmd_spool/job1636351/slurm_script: line 8: /home/y0095063/venv/bin/activate: No such file or directory. I figured that is probably a problem in the python_job.sh as I have your repository in a subfolder. So I changed the command source ~/venv/bin/activate to source ./venv/bin/activate I then resubmitted the job and the error did not show up again but the job was still cancelled immediately. The slurm output file was now blank though. I was not able to fix this.

Can you take a look?

Best Regards Fabian

darshan315 commented 3 years ago

Hi Fabian,

Thank you for the notifying typos.

To execute training, you can execute the shell script either from parent directory or child directory (DRL_py). If you are executing from parent directory then following implies : sbatch ./DRL_py/python_job.sh. And in case of child directory, first change directory to child directory by cd DRL_py and then to execute : sbatch python_job.sh. Regarding the error, There might be log file for main.py named under "py.log" in DRL_py directory. Can you provide me the file to investigate more.

Best Regards, Darshan

FabianGabriel commented 3 years ago

Hi Darshan,

The message in the py.log file was something like 'missing file or directory main.py'. So I executed it in the child directory. That did produce several slurm jobs but they all cancelled immediately, The error in the first one is /var/tmp/slurmd_spool/job1636457/slurm_script: line 8: ./venv/bin/activate: No such file or directory The error in all the other ones is FATAL: could not open image /home/y0095063/flow_past_cylinder_by_DRL/of2006-py1.6-cpu.sif: failed to retrieve path for /home/y0095063/flow_past_cylinder_by_DRL/of2006-py1.6-cpu.sif: lstat /home/y0095063/flow_past_cylinder_by_DRL/of2006-py1.6-cpu.sif: no such file or directory

Best Regards Fabian

darshan315 commented 3 years ago

Hi Fabian,

You are missing the singularity image (of2006-py1.6-cpu.sif) which is the cause of error. You can build singularity image by following instruction given here. Or i can provide it to you.

The singularity image should be in parent directory.

Bests, Darshan Thummar.

FabianGabriel commented 3 years ago

Hi Darshan,

I think it would be best if you provide it to me as the instructions in the link you posted lead to a more recent version (of2012-py1.7.1) and I don't know if just renaming it would be possible.

Best Regards Fabian

darshan315 commented 3 years ago

If you want to use newer version than you have to modify the python script (env_cluster.py : here) according to newer version. Renaming would work if there is no functionality dispersed in newer version, however i would not recommend it.

AndreWeiner commented 3 years ago

Hi Fabian and Darshan, I would suggest first make it work with the old container and then switch to the new one. There should be no changes in the PyTorch API or OpenFOAM that could affect the RL workflow. Best, Andre

AndreWeiner commented 3 years ago

Hi Darshan, I think you can close this issue. Best, Andre

darshan315 / flow_past_cylinder_by_DRL

Problem in setting up this DRL-part #38