Closed Baylus closed 3 months ago
I am getting mighty annoyed. Windows is perfectly respecting the boundaries of the keyboard interrupt, but WSL is completely ignoring any attempts that I make to try to get it to recognize that I am telling it to stop.
I recognize that it works a bit differently because the threads are working and need to be told to stop, but its like the main program isn't even recognizing the valid keyboard interrupt.
Alright, I did manage to get the interrupt working for both platforms. It kinda feels like WSL just decided it felt like working suddenly.
Now I am going to test out some values for different worker numbers for the training. My guess is that all that this is really gonna benefit from is 2 workers, but I want to make sure we are making the most of this extra training potential.
Starting new run for parallel linux training with 2 workers now
It is saying that the run took 20 minutes to train... That is an insanely high speed increase, and it leads me to believe that there were some shortcuts made to the training.
Looks like the majority of scores were negative, and the fall off from initial testing was much higher compared to normal training iterations.
Starting regular training now to get a better idea of the contrast between the two.
Alright, slightly longer. This took 14 hours. This is the graph.
I can think of zero reason that this speed up would improve the training time by like, 14 times, other than it is not doing what it should be. And the training graph seems to prove it, I am going to run some more tests and see if any of them look similar to what we expect out of a normal training graph.
Alright, really obvious problem time. The reason its so much faster is because its not doing anything. This method that I used while NEAT training worked because each child process was independent and didnt need to communicate with its siblings. It also worked because the result each process found was returned to its parent process.
So I need to find some way of allowing each child (and the parent) processes to share that model with each other. This may end up just being more overhead than its worth, but if its possible to get a performance boost from this small improvement, I will be super happy.
Starting another run now using thread based parallelism. This means it should share the same memory space, without needing a bunch of overhead that might result in no improvement at all.
This will mean that we are even more limited by the number of "workers" that will be of value, because there can only be as many threads as there are on the processor that the process is run on. Which, for me, is a AMD Ryzen 5 5600X 6-core processor, which is dual threaded. Therefore, if I understand this all correctly, the maximum benefit would be limited to only 2 "workers".
Alright, this run took 16:15 hours. This a run I did with the thread based improvement, but only one single worker, so there definitely shouldn't be an improvement, but I did not expect such a decline in performance with this, so it might point to this solution not being very beneficial.
The only thing I find curious is that there were 0 timeouts in this testing... its really hard to understand why there would be none at all, but I am gonna run another with 2 threads and if that doesnt show any then something is strange with how I am identifying them.
Now that I think about it, we just recently made it so we were using our GPU rather than CPU to process these model updates. If that is the case, then wouldn't it not matter at all if we had multiple threads, because they are going to be limited on the GPU requests that they can make?
This was already a cheesy strategy to try and milk a little bit more performance out of the training before I had to just bite the bullet and do a major improvement to implement a large scale solution to improve the training method.
Alright it seized up. It got good results, but I cant think of why it wouldve stopped. Might be something to do with my interpreter or something like that. From the timeline of modified files, it looks like the run took about 15 hours. This is slower than the normal runs take. I am going to try on windows, where GPU usage is not supported, and see if the thread based processing speeds it up at all.
Here is the standard time delta: 14:44:14.146766 Here is the processed time: 19:52:24.828125
Alright, this is significantly disappointing, so I will not be moving forward with these changes.
So, I am really trying my best to speed this up, as our current projection for 10,000 training iterations is about 500 days. Which is insane.
I don't expect this to be an insane improvement, but I am hoping it helps a decent amount.
I will be running tests to figure out the sweet spot for the number of workers to use for this, since this is different than normal, where throwing more workers increases speed. There is a number of workers that will maximize the value of this, since there are locks to prevent