Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.93k stars 4.14k forks source link

Fatal Error: Could not allocate memory #3024

Closed devedse closed 4 years ago

devedse commented 4 years ago

Describe the bug Hi all, I've created a quite simple racing game where an agent learns to drive on a track:

image

image

The agent has the following Agent code:

public class RaceGameAgent : Agent
{
    public RoadDetectorScript RoadDetector;
    public ArcadeCar ArcadeCar;
    public RaceTrackScript RaceTrackScript;

    public Vector3 startPos;

    public int FrameCountSinceReset = 0;

    public void Start()
    {
        startPos = ArcadeCar.transform.position;
    }

    public override void AgentAction(float[] vectorAction, string textAction, CustomActionProto customAction)
    {
        ArcadeCar.AIXAxisInput = vectorAction[0];
        ArcadeCar.AIYAxisInput = vectorAction[1];
        ArcadeCar.AIHandBrakeInput = vectorAction[2] > 0.5f;

        //For some reason it resets twice, I think it's due to Resetting not creating gameobjects in the same fixedupdate
        if (FrameCountSinceReset > 1)
        {
            if (RoadDetector.CountWheelsOnRoad == 0)
            {
                Done();
                SetReward(-1f);
            }
            else
            {
                if (RoadDetector.CountWheelsOffRoad == 0)
                {
                    SetReward((ArcadeCar.GetSpeed() * 3.6f - 10.0f) / 200.0f);
                }
                else
                {
                    SetReward(RoadDetector.CountWheelsOffRoad * -0.1f);
                }
            }

            if (GetCumulativeReward() < -100f)
            {
                Done();
            }
        }

        FrameCountSinceReset++;
    }

    public override void AgentAction(float[] vectorAction, string textAction)
    {

    }

    public override void CollectObservations()
    {
        this.AddVectorObs(gameObject.transform.rotation);
        this.AddVectorObs(ArcadeCar.transform.GetComponent<Rigidbody>().velocity);
        this.AddVectorObs(ArcadeCar.GetSpeed());

        var tracers = gameObject.GetComponentsInChildren<TracerScript>();
        foreach (var tracer in tracers)
        {
            this.AddVectorObs(tracer.CastRay());
        }

        this.AddVectorObs(RoadDetector.WheelLFIsRoad);
        this.AddVectorObs(RoadDetector.WheelRFIsRoad);
        this.AddVectorObs(RoadDetector.WheelLBIsRoad);
        this.AddVectorObs(RoadDetector.WheelRBIsRoad);

        this.AddVectorObs(RoadDetector.CountWheelsOnRoad);
        this.AddVectorObs(RoadDetector.CountWheelsOffRoad);

        base.CollectObservations();
    }

    public override void AgentReset()
    {
        ArcadeCar.Reset(startPos, 180);
        RaceTrackScript.Reset();
        FrameCountSinceReset = 0;
    }
}

During the training process you can see the memory usage of Unity.exe slowly increasing, after about a night of training I checked and saw that the memory usage capped out on 32gb.

After about 10 minutes it dropped again to 16gb (I would assume due to the garbage collector running.

After about 2 days of training I ran into the following issue:

image

I'm not exactly sure where to start when solving this issue so hopefully someone here could give some advice. Could it be due to a memory leak in ml-agents or does no one else see this problem?

I did also replace the memory (previously I had 16gb and now I moved to new sticks for a total of 32gb) but that did also not solve the issue.

To Reproduce See above

Console logs / stack traces Only logging I have is the screenshot above

Screenshots See above

Environment (please complete the following information):

surfnerd commented 4 years ago

Hi @devedse, Just a quick look at your screen shots, it seems like the allocation is happening in your code. Are you sure that there isn't a memory leak in your RaceTrackScript?

Could you share your project with us? Without an in depth view into your code, it will be hard to know what's going on. You can always install the Unity Memory profiler from the package manager to see what type of objects are leaking.

devedse commented 4 years ago

I'll do some investigation with the memory profiler tomorrow first. Would it be possible to share a private repository?

surfnerd commented 4 years ago

You could add me as a collaborator and remove me whenever you feel.

devedse commented 4 years ago

I've been running the memory profiler.

Screenshot of memory usage just after start: image

Screenshot of memory usage afte 5 minutes of running: image

And the diff: image

I've also created a private repository which I will add you to. To actually start training I'm currently using the TrainRace.cmd script from our fork of ml-agents: https://github.com/devedse/ml-agents/blob/master/TrainRace.cmd

You'll receive an invite for the other private repository shortly :smile:

devedse commented 4 years ago

When digging through the Memory Profile I see a lot of arrays with empty values, e.g.: image

The same happens with arrays for UINT64 and String. Loads of 0 values or null values.

surfnerd commented 4 years ago

I got the invite. I'll see if I can take some time today to run a profiling trace myself. We haven't seen any memory leaks with our nightly training sessions. But maybe you are hitting a case that we haven't covered. Thanks for your cooperation and help debugging this.

surfnerd commented 4 years ago

Are you by chance using visual observations?

devedse commented 4 years ago

Hey, that's great thanks. It could very well be some misunderstanding for me about how specific things in Unity work but let's see.

I don't use visual observations yet. The inputs of the ml agent are a bunch of Ray traces that input booleans wether they hit the road or hit the grass.

surfnerd commented 4 years ago

Hey @devedse, I was able to reproduce the leak without training. I'm just running the RaceGameScene and seeing large buffers of UINT64 and System.Byte[]. I'm still not sure if it's coming from ML-Agents or not. I will dig a bit further.

surfnerd commented 4 years ago

So the large amount of INT64, String, and Byte[] seems to be coming from the profiler itself, which is unfortunate noise. I'll keep digging...

surfnerd commented 4 years ago

I do, however, see your material memory growing significantly. It grew from around 20MB from about 1 minute after the start to about 70MB 10 minutes later.

Here is a screen shot of a small portion of the Material memory.
Screen Shot 2019-12-05 at 9 51 14 AM

Please see make sure you are correctly disposing of materials that are no longer being used, or ensure to mark them as shared if you are going to reuse them. I would guess that this could be a leak from the random generation of your track, but I'm not sure.

I hope this helps.

surfnerd commented 4 years ago

Related forum thread about disposing of materials https://forum.unity.com/threads/unityengine-material-object-memory-leak.48623/

surfnerd commented 4 years ago

Since the materials have the (Instance) tag, it tells me you need to dispose of them manually. This should solve your memory leak.

devedse commented 4 years ago

@surfnerd , thanks for this, I'm sorry for the confusion around this.

I couldn't find a way to Dispose a material.

What I'll try is actually setting the meshRenderer.sharedMaterial instead of the meshRenderer.material. I'm not sure if this will solve the issue as well? (The only thing I could find on google was that sharedMaterial returns the actual reference whereas .material returns a copy for this specific object. But it doesn't state anything about setting it).

So for now I'll implement this patch: Old

renderer.material = foundPiece.Mat;

New

renderer.sharedMaterial = foundPiece.Mat;

Another question I have is, do I also need to do something similar for the MiniMap? I'm using the following code to update the images there:

var img = ga.GetComponent<Image>();
img.sprite = foundPiece.Sprite;
devedse commented 4 years ago

I'm currently retraining the application and it seems to have not really been resolved as the memory usage of Unity is now at around 7gb.

Could it be that simply setting .sharedMaterial isn't good enough?

If that's the case, is there another solution or should I Dispose the materials themselves somehow? (Should this be done by doing something like Destroy(ga.GetComponent<MeshRenderer>().material))

surfnerd commented 4 years ago

Yes, using the sharedMaterial property may or may not be appropriate for what you are doing. You will need to destroy any material clones that you’ve created using the Destroy function as you stated.

devedse commented 4 years ago

@surfnerd , Hi, had a bit of a busy period so sorry for the delay in updates. Yesterday I made a fix to the code to now destroy materials on recreation of the map. I'm not sure if this has fixed the issue though because after one night of training I'm using about 8gb of memory again:

image

Whether this has fixed the issue is to be seen. I'll keep the training running for a few days and will inform you on the progress.

All commits I made: https://github.com/devedse/DS-MLUnityPrivate/commits/master

surfnerd commented 4 years ago

Thanks for the update. Can you also take a memory snapshot with the Memory profiler? I'd like to see what is taking up so much memory.

devedse commented 4 years ago

@surfnerd , I just took a screenshot when Unity was using ~20gb of memory. When I then opened this snapshot the Unity memory usage spiked to ~27gb so that's why in the screenshot below the memory usage is higher.

Anyway, the snapshot: image

And a screenshot of the whole Table sorted by reference count: image

It seems there's a few gigabytes here and there in for example shaders.

And Task Manager: image

Is there anything else you would like to see in this Snapshot?

surfnerd commented 4 years ago

Hmm, this is pretty weird. The amount of total memory in the snapshot is definitely less than ~20GB. It's more on the scale of around 200-300MB. I'll take a look at the project again and see if I can find anything else.

devedse commented 4 years ago

I did a bit of scrolling through the snapshot as well and think it might be a bit higher then 200-300MB due to there being quite a lot of instances for some specific 1MB objects. For example there's 935 instances of Texture2D.

However I also agree that this doesn't seem to add up to 20gb.

This morning however I checked again and saw the memory usage dropped to 14GB: image

It could possibly just be an artifact of the Garbage collector not running that often, but I'm not really sure.

devedse commented 4 years ago

Another thing I saw was that the time between snapshots also gradually increases.

At the fist few snapshots I see there's a difference of about 50 seconds per 1000 steps.

At the end this duration has increased to about 220 seconds:

INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 1000. Time Elapsed: 52.796 s Mean Reward: -38.610. Std of Reward: 17.850. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2000. Time Elapsed: 102.428 s Mean Reward: -34.189. Std of Reward: 20.628. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 3000. Time Elapsed: 152.381 s Mean Reward: -23.578. Std of Reward: 20.009. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 4000. Time Elapsed: 206.723 s Mean Reward: -15.534. Std of Reward: 16.491. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 5000. Time Elapsed: 257.479 s Mean Reward: -7.267. Std of Reward: 10.994. Training.

...
...
...

INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2319000. Time Elapsed: 317379.718 s Mean Reward: 1467.813. Std of Reward: 1560.897. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2320000. Time Elapsed: 317599.201 s Mean Reward: 601.707. Std of Reward: 293.869. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2321000. Time Elapsed: 317824.258 s Mean Reward: 1292.860. Std of Reward: 1052.086. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2322000. Time Elapsed: 318044.004 s Mean Reward: 1013.822. Std of Reward: 1115.018. Training.
INFO:mlagents.trainers: race_7: RaceGameLearningBrain: Step: 2323000. Time Elapsed: 318266.715 s Mean Reward: 1757.583. Std of Reward: 2141.269. Training.

And for completeness, the last Tensorboard output: image

And smoothed: image

surfnerd commented 4 years ago

Do you happen to have the memory snapshot files you can send to me? Or post to google drive or something. I don't think I'll be able to run your game for as long :P

surfnerd commented 4 years ago

One guess on why the training would be slowing down is if your memory is getting really fragmented with the creation and deletion of materials/textures/etc. It may also explain the heap size of the unity process. I'll see if I can find anything with some of my own spelunking.

devedse commented 4 years ago

Aw shit can only do that next Monday. I'll post it then :).

devedse commented 4 years ago

Here's the last snapshot from Friday: https://drive.google.com/open?id=12PKfydD9lMUX3mtwgmEMSu0_uhusF7DR

And a new one from today: https://drive.google.com/file/d/1bHMs-2HassYa0EoeoGe3u_iYctBwt530/view?usp=sharing

Strange thing is though that the one from Friday was about 3gb while the one from today is 10gb.

Latest Task Manager:

image

Latest tensorboard: image

devedse commented 4 years ago

This morning the run finally completed 😃. Unity still used a ton of memory though. The interesting part is, that even after the run completed and Unity was idling in the Editor it would still consume about 26gb's of memory.

What I then tried was pressing the play button to see how the newly trained model would perform, when I did this Unity went in a "Not responding" state and still used about 22 gb of memory:

image

After restarting Unity everything worked again and the newly trained model performs great 😄

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 14 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

devedse commented 4 years ago

@surfnerd , do you have any updates or ideas or shall we leave this issue as stale?

surfnerd commented 4 years ago

Hey @devedse, I apologize for not getting back to you. It has been quite busy lately. I'd like to investigate further but may not get to it for a while. I'll add the bug label and file this in our internal tracker. I'd like to get to the bottom of it.

surfnerd commented 4 years ago

Hi @devedse, I've filed this under MLA-535 in our internal tracker. We will prioritize this and update this issue when we have an update.

surfnerd commented 4 years ago

friendly ping @devedse, I was wondering if you were still having this issue. From the debugging I did, I was unable to find a leak in our code. Where you able to find any more in yours?

surfnerd commented 4 years ago

Hi @devedse, We have not been able to find any memory leaks on our end after a few months of testing. I am going to close this now. I hope you have resolved your issue. Cheers.

devedse commented 4 years ago

@surfnerd , sorry for my late response but haven't been working with Unity for a while. For now the issues seems to have been resolved by doing the following:

image

Once I'll get back to working with Unity I'll run the training algorithm again and see if it remains working. Thanks for the quick responses and help you offered!

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.