Farama-Foundation / Gymnasium-Robotics

A collection of robotics simulation environments for reinforcement learning
https://robotics.farama.org/
MIT License
485 stars 79 forks source link

Change XML saving for maze to fix infrequent crash #206

Open mishmish66 opened 6 months ago

mishmish66 commented 6 months ago

Description

I changed the way that temporary xmls are stored for the maze envs. I have used this change for training and it improves stability, without it my training would crash sometimes, usually after about 1m env steps. With this change it seems to be stable.

Type of change

Please delete options that are not relevant.

Checklist:

Kallinteris-Andreas commented 6 months ago

Can you please elaborate on why the crashing was happening prior to this change?

mishmish66 commented 6 months ago

Honestly it's a little unclear to me, the crash threw an XML parsing error giving a line and reason, but obviously the XML is supposed to be working. My guess is either the XML tool we are using to write to the XML file does not flush the buffers correctly internally, and since here we use the open() context manager it flushes as needed, or it's because there is a collision since I'm using a lot of parallelism and this is using timestamping. With these changes the directories are unique regardless of time, and also the file is flushed fully.

mishmish66 commented 6 months ago

@Kallinteris-Andreas is there anything else I should do to get this fix in?

Kallinteris-Andreas commented 5 months ago

@mishmish66 any progress on finding the test?

Note: I am willing to merge without a test, since it is hard to assert concurrency behavior

mishmish66 commented 5 months ago

No, I couldn't replicate the errors in a test which is weird. I was seeing this on a node in a cluster so I guess testing it locally is slightly different, but in any case I guess it isn't a very significant issue if it only happens on strangely configured cluster nodes.

Kallinteris-Andreas commented 5 months ago

Just remove test_multiprocessing_not_crash and I can merge this, it is still an improvement over the existing implementation

Kallinteris-Andreas commented 5 months ago

@mishmish66 rebase and remove the test, and I can merge

mishmish66 commented 5 months ago

I'll do that! Now that the semester is going again it may take me a second to get to it. Thanks!

Kallinteris-Andreas commented 3 months ago

@mishmish66 please remove tests and rebase

mishmish66 commented 2 months ago

@Kallinteris-Andreas I rebased and fixed a warning that was thrown from the env on cleanup that was causing the tests to fail. I also added the change to maze-v3 envs, let me know if that is the wrong thing to do and I'll go remove that!

Thanks for your patience!