Closed vivianchen98 closed 2 years ago
Sorry to hear about these issues. Some more information would allow us to provide more help - for example, can you post the config you were using? And which dataset is this, the proficient human dataset, or multi human dataset? Also have you tried with bc
-- e.g. is it specifically those algorithms, or are all algorithms failing on your cluster? Have you also tried monitoring the memory usage used by the program, and when it increases (e.g. does it increase steadily after every batch)?
The confs (td3_bc
, bcq
, cql
) are attached here, and I am using the PH transport dataset.
configs.zip
I have tried with bc
and hbc
which ran successfully on the cluster. I am not sure how to verify the memory usage when submitting the jobs to the cluster (I am using condor here). Do you have a common practice for monitoring the memory usage while running experiments on clusters? If you could give me some pointers, that would be greatly appreciated. Thanks a lot in advance!
@snasiriany will reply with some pointers on monitoring memory usage.
In the meantime, could you also try this BCQ config? configs.zip
I modified yours to make it so that the dataset is not stored in memory, and I turned off video rendering and rollouts as well, so you can isolate memory issues to torch training, instead of potential memory issues from robosuite environments. I hope this is helpful!
Hi @vivianchen98, regarding memory usage, here's what has worked for me: I add the following lines in robomimic/scripts/train.py
starting at line 226:
# Log RAM usage
import psutil
process = psutil.Process(os.getpid())
k = 'RAM Usage (MB)'
v = int(process.memory_info().rss / 1000000)
data_logger.record("System/{}".format(k), v, epoch)
Just make sure that psutil
is installed. If it is not, you can run pip install psutil
.
This will log the memory usage per epoch under the key System/RAM Usage (MB)
. Hopefully this helps to track the memory usage over time and identify where things are running out of memory. You may want to move this code around to different parts of the code too -- for example right after video logging, and so forth. Please let me know if this was helpful!
Hi @snasiriany,
Thanks a lot for the instruction! However, there is an issue with running with your memory usage monitoring code snippet. It seems that my code ran into segmentation fault while still training the first epoch, so it did not go into the memory usage part as you suggested. Could you provide me a pointer to where I can monitor the memory usage for each step inside a training epoch? Thanks a lot in advance!
I've printed the RAM usage of the algorithm, and it shows that the RAM usage is constantly 4545 MB (while the cluster machine has 55G RAM) , so I believe it has nothing to do with the memory leak. Do you have any hunch on what the problem is?
It's hard to say. Did you try running the config I posted earlier? Also, perhaps try training without GPU to see if that changes anything?
Closing this issue for now - please re-open if this issue persists
Hi,
I am trying to run all the algorithms on the
TwoArmTransport
environment, and I ran into Segmentation issue when tryingtd3_bc
,bcq
andcql
on our school's cluster (with GeForce GTX 1080 with 8120 MB memory). Here is an example of the segmentation fault when running thetd3_bc
algorithm on thelow_dim
dataset. I tried to investigate a little bit, but it's not clear to me what is causing this segfault issue (I've attached the error message below from the terminal). There is no such issue if I run these algorithms on my own laptop. It would be great if there are solutions to the segfault so that I can run my experiments on the cluster. Thanks a lot in advance.