Closed FabianGabriel closed 3 years ago
Hi Fabian,
Thank you for the notifying typos.
To execute training, you can execute the shell script either from parent directory or child directory (DRL_py). If you are executing from parent directory then following implies : sbatch ./DRL_py/python_job.sh
. And in case of child directory, first change directory to child directory by cd DRL_py
and then to execute : sbatch python_job.sh
.
Regarding the error, There might be log file for main.py named under "py.log" in DRL_py directory. Can you provide me the file to investigate more.
Best Regards, Darshan
Hi Darshan,
The message in the py.log file was something like 'missing file or directory main.py'. So I executed it in the child directory.
That did produce several slurm jobs but they all cancelled immediately,
The error in the first one is /var/tmp/slurmd_spool/job1636457/slurm_script: line 8: ./venv/bin/activate: No such file or directory
The error in all the other ones is FATAL: could not open image /home/y0095063/flow_past_cylinder_by_DRL/of2006-py1.6-cpu.sif: failed to retrieve path for /home/y0095063/flow_past_cylinder_by_DRL/of2006-py1.6-cpu.sif: lstat /home/y0095063/flow_past_cylinder_by_DRL/of2006-py1.6-cpu.sif: no such file or directory
Best Regards Fabian
Hi Fabian,
You are missing the singularity image (of2006-py1.6-cpu.sif) which is the cause of error. You can build singularity image by following instruction given here. Or i can provide it to you.
The singularity image should be in parent directory.
Bests, Darshan Thummar.
Hi Darshan,
I think it would be best if you provide it to me as the instructions in the link you posted lead to a more recent version (of2012-py1.7.1) and I don't know if just renaming it would be possible.
Best Regards Fabian
If you want to use newer version than you have to modify the python script (env_cluster.py
: here) according to newer version. Renaming would work if there is no functionality dispersed in newer version, however i would not recommend it.
Hi Fabian and Darshan, I would suggest first make it work with the old container and then switch to the new one. There should be no changes in the PyTorch API or OpenFOAM that could affect the RL workflow. Best, Andre
Hi Darshan, I think you can close this issue. Best, Andre
Hi Darshan,
I tried to follow your instructions of how to set up the DRL-part of this repository. I came across multiple little problems:
module load python / 3.7
the blankspaces are wrong there. It should be:module load python/3.7
pip install -r ./DRL_py/docker/requirement.txt
. It should be pip install -r ./DRLpy/docker/requirements_.txttriaing
And one big problem: After I executed all these commands I wanted to submit the training job on the cluster with this command:
sbatch python_job.sh
First of all that should besbatch ./DRL_py/python_job.sh
After I changed it the slurm manager put this out/var/tmp/slurmd_spool/job1636351/slurm_script: line 8: /home/y0095063/venv/bin/activate: No such file or directory
. I figured that is probably a problem in the python_job.sh as I have your repository in a subfolder. So I changed the commandsource ~/venv/bin/activate
tosource ./venv/bin/activate
I then resubmitted the job and the error did not show up again but the job was still cancelled immediately. The slurm output file was now blank though. I was not able to fix this.Can you take a look?
Best Regards Fabian