coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
7 stars 3 forks source link

Reuse S3 connection (resource) through thread's lifetime #77

Closed StephenChan closed 9 months ago

StephenChan commented 9 months ago

Here's the first thing I wanted to look into for issue #73.

boto3.resource('s3', <credentials>) was being called once for each feature-vector download (once per image per epoch) during training. Training was taking somewhere between 0.1-0.3 seconds per image per epoch in many cases, and a connection/authentication step like this seemed like a potentially significant factor at that timescale.

I wanted to see if that resource could be reused between feature-vector downloads, or indeed between any uses of boto's S3 API within the same thread. So I tested it, and yes, we can reuse the resource. I don't know if it expires after a certain time, but so far it's held up through a training of 60,000 images x 2 epochs (120,000 uses), and (in a separate test) about 18 hours. I'm still testing the longevity aspect to see if the resource lasts at least a day or two.

As for the training performance improvement, see the following CSV of experiments (all EfficientNet + MLP). "reuse conn" is the new code, "new conn" is the old code:

train-time_s3-connection-reuse_2023-12.csv

(Abridged data, since the whole thing doesn't fit well in a GitHub thread:)

Source Images Points/im Epochs Time/im/epoch: reuse conn Time/im/epoch: new conn Time-diff/im/epoch Speedup ratio
1097 1686 10 10 0.046 0.113 0.067 2.48
1097 1686 10 2 0.061 0.133 0.072 2.182
1097 1686 10 2 0.055 0.128 0.073 2.312
1388 8706 ~8.5 2 0.084 0.158 0.075 1.889
3064 2211 25 2 0.094 0.182 0.088 1.928
3413 5079 25 2 0.109 0.179 0.07 1.638
2138 720 49 2 0.087 0.182 0.095 2.1
3354 1600 50 2 0.106 0.195 0.088 1.832
1933 560 100 2 0.112 0.189 0.077 1.691
3520 253 200 2 0.115 0.189 0.074 1.639
1360 49 1000 2 0.313 0.358 0.045 1.143

So with coralnet's current infrastructure, the change saves roughly 0.07 seconds per image per epoch, and overall speeds up training by 1.6-2.2x. There's some variance of unknown origin, maybe just AWS server load, as seen in the multiple tests on source 1097. But I'd say the speedup is still fairly clear.

The time per image per epoch still isn't very close to being linearly related to the points per image, so there's still more to be looked at with issue #73, but this is a start.

StephenChan commented 9 months ago

Reworked so that the connection's not shared between threads (that was the original goal, but I did it wrong) and doesn't need the global keyword.

StephenChan commented 9 months ago

OK, with the latest code, the same connection/resource was still able to be reused over 2 days later. Let's merge it.