Segfault of some algorithms on cluster

vivianchen98 commented 3 years ago

Hi,

I am trying to run all the algorithms on the TwoArmTransport environment, and I ran into Segmentation issue when trying td3_bc, bcq and cql on our school's cluster (with GeForce GTX 1080 with 8120 MB memory). Here is an example of the segmentation fault when running the td3_bc algorithm on the low_dim dataset. I tried to investigate a little bit, but it's not clear to me what is causing this segfault issue (I've attached the error message below from the terminal). There is no such issue if I run these algorithms on my own laptop. It would be great if there are solutions to the segfault so that I can run my experiments on the cluster. Thanks a lot in advance.

SequenceDataset (
    path=robomimic_data/low_dim.hdf5
    obs_keys=('object', 'robot0_eef_pos', 'robot0_eef_quat', 'robot0_gripper_qpos')
    seq_length=1
    filter_key=none
    frame_stack=1
    pad_seq_length=True
    pad_frame_stack=True
    goal_mode=none
    cache_mode=all
    num_demos=200
    num_sequences=93752
)

 10%|#         | 519/5000 [00:28<04:03, 18.43it/s]Segmentation fault (core dumped)

amandlek commented 3 years ago

Sorry to hear about these issues. Some more information would allow us to provide more help - for example, can you post the config you were using? And which dataset is this, the proficient human dataset, or multi human dataset? Also have you tried with bc -- e.g. is it specifically those algorithms, or are all algorithms failing on your cluster? Have you also tried monitoring the memory usage used by the program, and when it increases (e.g. does it increase steadily after every batch)?

vivianchen98 commented 3 years ago

The confs (td3_bc, bcq, cql) are attached here, and I am using the PH transport dataset. configs.zip

I have tried with bc and hbc which ran successfully on the cluster. I am not sure how to verify the memory usage when submitting the jobs to the cluster (I am using condor here). Do you have a common practice for monitoring the memory usage while running experiments on clusters? If you could give me some pointers, that would be greatly appreciated. Thanks a lot in advance!

amandlek commented 3 years ago

@snasiriany will reply with some pointers on monitoring memory usage.

In the meantime, could you also try this BCQ config? configs.zip

I modified yours to make it so that the dataset is not stored in memory, and I turned off video rendering and rollouts as well, so you can isolate memory issues to torch training, instead of potential memory issues from robosuite environments. I hope this is helpful!

snasiriany commented 3 years ago

Hi @vivianchen98, regarding memory usage, here's what has worked for me: I add the following lines in robomimic/scripts/train.py starting at line 226:

# Log RAM usage
import psutil
process = psutil.Process(os.getpid())
k = 'RAM Usage (MB)'
v = int(process.memory_info().rss / 1000000)
data_logger.record("System/{}".format(k), v, epoch)

Just make sure that psutil is installed. If it is not, you can run pip install psutil.

This will log the memory usage per epoch under the key System/RAM Usage (MB). Hopefully this helps to track the memory usage over time and identify where things are running out of memory. You may want to move this code around to different parts of the code too -- for example right after video logging, and so forth. Please let me know if this was helpful!

vivianchen98 commented 3 years ago

Hi @snasiriany,

Thanks a lot for the instruction! However, there is an issue with running with your memory usage monitoring code snippet. It seems that my code ran into segmentation fault while still training the first epoch, so it did not go into the memory usage part as you suggested. Could you provide me a pointer to where I can monitor the memory usage for each step inside a training epoch? Thanks a lot in advance!

amandlek commented 3 years ago

You can go ahead and monitor it in this function here.

vivianchen98 commented 3 years ago

I've printed the RAM usage of the algorithm, and it shows that the RAM usage is constantly 4545 MB (while the cluster machine has 55G RAM) , so I believe it has nothing to do with the memory leak. Do you have any hunch on what the problem is?

amandlek commented 3 years ago

It's hard to say. Did you try running the config I posted earlier? Also, perhaps try training without GPU to see if that changes anything?

amandlek commented 2 years ago

Closing this issue for now - please re-open if this issue persists

ARISE-Initiative / robomimic

Segfault of some algorithms on cluster #6