Closed StephenChan closed 9 months ago
Reworked so that the connection's not shared between threads (that was the original goal, but I did it wrong) and doesn't need the global
keyword.
OK, with the latest code, the same connection/resource was still able to be reused over 2 days later. Let's merge it.
Here's the first thing I wanted to look into for issue #73.
boto3.resource('s3', <credentials>)
was being called once for each feature-vector download (once per image per epoch) during training. Training was taking somewhere between 0.1-0.3 seconds per image per epoch in many cases, and a connection/authentication step like this seemed like a potentially significant factor at that timescale.I wanted to see if that resource could be reused between feature-vector downloads, or indeed between any uses of boto's S3 API within the same thread. So I tested it, and yes, we can reuse the resource. I don't know if it expires after a certain time, but so far it's held up through a training of 60,000 images x 2 epochs (120,000 uses), and (in a separate test) about 18 hours. I'm still testing the longevity aspect to see if the resource lasts at least a day or two.
As for the training performance improvement, see the following CSV of experiments (all EfficientNet + MLP). "reuse conn" is the new code, "new conn" is the old code:
train-time_s3-connection-reuse_2023-12.csv
(Abridged data, since the whole thing doesn't fit well in a GitHub thread:)
So with coralnet's current infrastructure, the change saves roughly 0.07 seconds per image per epoch, and overall speeds up training by 1.6-2.2x. There's some variance of unknown origin, maybe just AWS server load, as seen in the multiple tests on source 1097. But I'd say the speedup is still fairly clear.
The time per image per epoch still isn't very close to being linearly related to the points per image, so there's still more to be looked at with issue #73, but this is a start.