Reuse S3 connection (resource) through thread's lifetime

StephenChan commented 9 months ago

Here's the first thing I wanted to look into for issue #73.

boto3.resource('s3', <credentials>) was being called once for each feature-vector download (once per image per epoch) during training. Training was taking somewhere between 0.1-0.3 seconds per image per epoch in many cases, and a connection/authentication step like this seemed like a potentially significant factor at that timescale.

I wanted to see if that resource could be reused between feature-vector downloads, or indeed between any uses of boto's S3 API within the same thread. So I tested it, and yes, we can reuse the resource. I don't know if it expires after a certain time, but so far it's held up through a training of 60,000 images x 2 epochs (120,000 uses), and (in a separate test) about 18 hours. I'm still testing the longevity aspect to see if the resource lasts at least a day or two.

As for the training performance improvement, see the following CSV of experiments (all EfficientNet + MLP). "reuse conn" is the new code, "new conn" is the old code:

train-time_s3-connection-reuse_2023-12.csv

(Abridged data, since the whole thing doesn't fit well in a GitHub thread:)

Source	Images	Points/im	Epochs	Time/im/epoch: reuse conn	Time/im/epoch: new conn	Time-diff/im/epoch	Speedup ratio
1097	1686	10	10	0.046	0.113	0.067	2.48
1097	1686	10	2	0.061	0.133	0.072	2.182
1097	1686	10	2	0.055	0.128	0.073	2.312
1388	8706	~8.5	2	0.084	0.158	0.075	1.889
3064	2211	25	2	0.094	0.182	0.088	1.928
3413	5079	25	2	0.109	0.179	0.07	1.638
2138	720	49	2	0.087	0.182	0.095	2.1
3354	1600	50	2	0.106	0.195	0.088	1.832
1933	560	100	2	0.112	0.189	0.077	1.691
3520	253	200	2	0.115	0.189	0.074	1.639
1360	49	1000	2	0.313	0.358	0.045	1.143

So with coralnet's current infrastructure, the change saves roughly 0.07 seconds per image per epoch, and overall speeds up training by 1.6-2.2x. There's some variance of unknown origin, maybe just AWS server load, as seen in the multiple tests on source 1097. But I'd say the speedup is still fairly clear.

The time per image per epoch still isn't very close to being linearly related to the points per image, so there's still more to be looked at with issue #73, but this is a start.

StephenChan commented 9 months ago

Reworked so that the connection's not shared between threads (that was the original goal, but I did it wrong) and doesn't need the global keyword.

StephenChan commented 9 months ago

OK, with the latest code, the same connection/resource was still able to be reused over 2 days later. Let's merge it.

coralnet / pyspacer

Reuse S3 connection (resource) through thread's lifetime #77