DaRL-LibSignal / LibSignal

107 stars 21 forks source link

The cpu memory problem about CoLight code. #6

Closed MPHarryZhang closed 11 months ago

MPHarryZhang commented 1 year ago

When I ran the CoLight code about the Hangzhou 4x4 dataset, I saw a problem the CPU (GPU not used) memory kept increasing with the episodes training. In the end, the memory (64G) was directly exploded and the program was killed. Can you tell me how to solve this problem?

derekmei233 commented 1 year ago

Hi Zhang. Thank you for the inquiry. I think the torch geometric library causes this problem. I will look into this problem this week.

MPHarryZhang commented 1 year ago

Thank you very much.

derekmei233 commented 1 year ago

Hi, I have changed part lines of code inside colight.py and debugged. I monitored the memory slightly increased in the first few epochs (around 30), possibly due to the numpy.array accumulated in the queue. After that, the memory is stable and does not increase. Could you try the new version to see if the problem is finally solved?

MPHarryZhang commented 1 year ago

Hi, I have changed part lines of code inside colight.py and debugged. I monitored the memory slightly increased in the first few epochs (around 30), possibly due to the numpy.array accumulated in the queue. After that, the memory is stable and does not increase. Could you try the new version to see if the problem is finally solved?

I'm very sorry that my question is still not resolved. Assuming batch size=64, colight will occupy the memory from 8G to 20G increasing within 20 episodes. If batch size=16, colight will occupy the memory from 8G to 30G increasing within 130 episodes. It seems that it still accumulates memory space every round. This really puzzles me. I try to use pytorch's automatic memory management mechanism, which will seriously slow down the running speed, and can only alleviate the speed of memory increase, and the memory will still continue to grow. I found that it keeps accumulating memory during the training episodes, but does not increase the memory during the test episodes after each training.

ShawLen commented 1 year ago

Hi Zhang, I have debugged the problem again. The cpu memory can remain stable during the model training in the latest experiment(batch size=64). Could you please pull the code again to see if the problem is solved?

MPHarryZhang commented 1 year ago

Hi Zhang, I have debugged the problem again. The cpu memory can remain stable during the model training in the latest experiment(batch size=64). Could you please pull the code again to see if the problem is solved?

Thank you very much. Maybe this problem has been solved. I successfully ran the program on the Hangzhou dataset. I will continue to test other cases.

CorneliusDeng commented 1 year ago

Hello, can I ask you for advice?

My CoLight model has not been running, I hope to ask you for advice, thank you very much

MPHarryZhang commented 1 year ago

Hello, can I ask you for advice?

My CoLight model has not been running, I hope to ask you for advice, thank you very much

Hello. What is the problem you meet? The author has already fixed the CoLight code and it can be run now.

CorneliusDeng commented 1 year ago

Hello, can I ask you for advice? My CoLight model has not been running, I hope to ask you for advice, thank you very much

Hello. What is the problem you meet? The author has already fixed the CoLight code and it can be run now.

I have successfully run the PressLight model, but encountered some difficulties in CoLight. I have a few questions show below:

1、What's the difference between "colight.py" and "colight_pytorch_agent.py"? If I want to run CoLight model, agent should use "colight.py" or "colight_pytorch_agent.py"? 2、I use the agent defined in "colight.py to execute the command "python run.py --ngpu 0 --network cityflow1x1 --agent colight --world cityflow", but it showes an error " File "/data/dengqi_code/LibSignal/agent/colight.py", line 36, in initself.graph = Registry.mapping['world_mapping']['graph_setting'].grapp, KeyError: 'graph_setting' ". 3、I couldn't find anything related to the graph in the configuration files "base.yml" and "colight.yml". I am confused.

I would greatly appreciate your assistance if you have the time. I'm working diligently to learn and would be grateful to understand and run this model to continue my research.

Thank you very much for taking the time to read my message. If you're willing, you can reach out to me at corneliusdeng@163.com. I'm looking forward to your response. I am a Chinese, if you too, maybe we can change to a more efficient way of communication.