Closed devedse closed 4 years ago
Hi @devedse,
Just a quick look at your screen shots, it seems like the allocation is happening in your code. Are you sure that there isn't a memory leak in your RaceTrackScript
?
Could you share your project with us? Without an in depth view into your code, it will be hard to know what's going on. You can always install the Unity Memory profiler from the package manager to see what type of objects are leaking.
I'll do some investigation with the memory profiler tomorrow first. Would it be possible to share a private repository?
You could add me as a collaborator and remove me whenever you feel.
I've been running the memory profiler.
Screenshot of memory usage just after start:
Screenshot of memory usage afte 5 minutes of running:
And the diff:
I've also created a private repository which I will add you to. To actually start training I'm currently using the TrainRace.cmd script from our fork of ml-agents: https://github.com/devedse/ml-agents/blob/master/TrainRace.cmd
You'll receive an invite for the other private repository shortly :smile:
When digging through the Memory Profile I see a lot of arrays with empty values, e.g.:
The same happens with arrays for UINT64 and String. Loads of 0 values or null values.
I got the invite. I'll see if I can take some time today to run a profiling trace myself. We haven't seen any memory leaks with our nightly training sessions. But maybe you are hitting a case that we haven't covered. Thanks for your cooperation and help debugging this.
Are you by chance using visual observations?
Hey, that's great thanks. It could very well be some misunderstanding for me about how specific things in Unity work but let's see.
I don't use visual observations yet. The inputs of the ml agent are a bunch of Ray traces that input booleans wether they hit the road or hit the grass.
Hey @devedse,
I was able to reproduce the leak without training. I'm just running the RaceGameScene and seeing large buffers of UINT64
and System.Byte[]
. I'm still not sure if it's coming from ML-Agents or not. I will dig a bit further.
So the large amount of INT64
, String
, and Byte[]
seems to be coming from the profiler itself, which is unfortunate noise. I'll keep digging...
I do, however, see your material memory growing significantly. It grew from around 20MB from about 1 minute after the start to about 70MB 10 minutes later.
Here is a screen shot of a small portion of the Material memory.
Please see make sure you are correctly disposing of materials that are no longer being used, or ensure to mark them as shared if you are going to reuse them. I would guess that this could be a leak from the random generation of your track, but I'm not sure.
I hope this helps.
Related forum thread about disposing of materials https://forum.unity.com/threads/unityengine-material-object-memory-leak.48623/
Since the materials have the (Instance) tag, it tells me you need to dispose of them manually. This should solve your memory leak.
@surfnerd , thanks for this, I'm sorry for the confusion around this.
I couldn't find a way to Dispose a material.
What I'll try is actually setting the meshRenderer.sharedMaterial instead of the meshRenderer.material. I'm not sure if this will solve the issue as well? (The only thing I could find on google was that sharedMaterial returns the actual reference whereas .material returns a copy for this specific object. But it doesn't state anything about setting it).
So for now I'll implement this patch: Old
renderer.material = foundPiece.Mat;
New
renderer.sharedMaterial = foundPiece.Mat;
Another question I have is, do I also need to do something similar for the MiniMap? I'm using the following code to update the images there:
var img = ga.GetComponent<Image>();
img.sprite = foundPiece.Sprite;
I'm currently retraining the application and it seems to have not really been resolved as the memory usage of Unity is now at around 7gb.
Could it be that simply setting .sharedMaterial isn't good enough?
If that's the case, is there another solution or should I Dispose the materials themselves somehow? (Should this be done by doing something like Destroy(ga.GetComponent<MeshRenderer>().material)
)
Yes, using the sharedMaterial property may or may not be appropriate for what you are doing. You will need to destroy any material clones that you’ve created using the Destroy
function as you stated.
@surfnerd , Hi, had a bit of a busy period so sorry for the delay in updates. Yesterday I made a fix to the code to now destroy materials on recreation of the map. I'm not sure if this has fixed the issue though because after one night of training I'm using about 8gb of memory again:
Whether this has fixed the issue is to be seen. I'll keep the training running for a few days and will inform you on the progress.
All commits I made: https://github.com/devedse/DS-MLUnityPrivate/commits/master
Thanks for the update. Can you also take a memory snapshot with the Memory profiler? I'd like to see what is taking up so much memory.
@surfnerd , I just took a screenshot when Unity was using ~20gb of memory. When I then opened this snapshot the Unity memory usage spiked to ~27gb so that's why in the screenshot below the memory usage is higher.
Anyway, the snapshot:
And a screenshot of the whole Table sorted by reference count:
It seems there's a few gigabytes here and there in for example shaders.
And Task Manager:
Is there anything else you would like to see in this Snapshot?
Hmm, this is pretty weird. The amount of total memory in the snapshot is definitely less than ~20GB. It's more on the scale of around 200-300MB. I'll take a look at the project again and see if I can find anything else.
I did a bit of scrolling through the snapshot as well and think it might be a bit higher then 200-300MB due to there being quite a lot of instances for some specific 1MB objects. For example there's 935 instances of Texture2D.
However I also agree that this doesn't seem to add up to 20gb.
This morning however I checked again and saw the memory usage dropped to 14GB:
It could possibly just be an artifact of the Garbage collector not running that often, but I'm not really sure.
Another thing I saw was that the time between snapshots also gradually increases.
At the fist few snapshots I see there's a difference of about 50 seconds per 1000 steps.
At the end this duration has increased to about 220 seconds:
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 1000. Time Elapsed: 52.796 s Mean Reward: -38.610. Std of Reward: 17.850. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2000. Time Elapsed: 102.428 s Mean Reward: -34.189. Std of Reward: 20.628. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 3000. Time Elapsed: 152.381 s Mean Reward: -23.578. Std of Reward: 20.009. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 4000. Time Elapsed: 206.723 s Mean Reward: -15.534. Std of Reward: 16.491. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 5000. Time Elapsed: 257.479 s Mean Reward: -7.267. Std of Reward: 10.994. Training.
...
...
...
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2319000. Time Elapsed: 317379.718 s Mean Reward: 1467.813. Std of Reward: 1560.897. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2320000. Time Elapsed: 317599.201 s Mean Reward: 601.707. Std of Reward: 293.869. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2321000. Time Elapsed: 317824.258 s Mean Reward: 1292.860. Std of Reward: 1052.086. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2322000. Time Elapsed: 318044.004 s Mean Reward: 1013.822. Std of Reward: 1115.018. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2323000. Time Elapsed: 318266.715 s Mean Reward: 1757.583. Std of Reward: 2141.269. Training.
And for completeness, the last Tensorboard output:
And smoothed:
Do you happen to have the memory snapshot files you can send to me? Or post to google drive or something. I don't think I'll be able to run your game for as long :P
One guess on why the training would be slowing down is if your memory is getting really fragmented with the creation and deletion of materials/textures/etc. It may also explain the heap size of the unity process. I'll see if I can find anything with some of my own spelunking.
Aw shit can only do that next Monday. I'll post it then :).
Here's the last snapshot from Friday: https://drive.google.com/open?id=12PKfydD9lMUX3mtwgmEMSu0_uhusF7DR
And a new one from today: https://drive.google.com/file/d/1bHMs-2HassYa0EoeoGe3u_iYctBwt530/view?usp=sharing
Strange thing is though that the one from Friday was about 3gb while the one from today is 10gb.
Latest Task Manager:
Latest tensorboard:
This morning the run finally completed 😃. Unity still used a ton of memory though. The interesting part is, that even after the run completed and Unity was idling in the Editor it would still consume about 26gb's of memory.
What I then tried was pressing the play button to see how the newly trained model would perform, when I did this Unity went in a "Not responding" state and still used about 22 gb of memory:
After restarting Unity everything worked again and the newly trained model performs great 😄
This issue has been automatically marked as stale because it has not had activity in the last 14 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.
@surfnerd , do you have any updates or ideas or shall we leave this issue as stale?
Hey @devedse, I apologize for not getting back to you. It has been quite busy lately. I'd like to investigate further but may not get to it for a while. I'll add the bug label and file this in our internal tracker. I'd like to get to the bottom of it.
Hi @devedse, I've filed this under MLA-535 in our internal tracker. We will prioritize this and update this issue when we have an update.
friendly ping @devedse, I was wondering if you were still having this issue. From the debugging I did, I was unable to find a leak in our code. Where you able to find any more in yours?
Hi @devedse, We have not been able to find any memory leaks on our end after a few months of testing. I am going to close this now. I hope you have resolved your issue. Cheers.
@surfnerd , sorry for my late response but haven't been working with Unity for a while. For now the issues seems to have been resolved by doing the following:
Once I'll get back to working with Unity I'll run the training algorithm again and see if it remains working. Thanks for the quick responses and help you offered!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Describe the bug Hi all, I've created a quite simple racing game where an agent learns to drive on a track:
The agent has the following Agent code:
During the training process you can see the memory usage of Unity.exe slowly increasing, after about a night of training I checked and saw that the memory usage capped out on 32gb.
After about 10 minutes it dropped again to 16gb (I would assume due to the garbage collector running.
After about 2 days of training I ran into the following issue:
I'm not exactly sure where to start when solving this issue so hopefully someone here could give some advice. Could it be due to a memory leak in ml-agents or does no one else see this problem?
I did also replace the memory (previously I had 16gb and now I moved to new sticks for a total of 32gb) but that did also not solve the issue.
To Reproduce
See above
Console logs / stack traces
Only logging I have is the screenshot above
Screenshots
See above
Environment (please complete the following information):