Try to speed up large training jobs

StephenChan commented 1 year ago

Motivation:

Training a CoralNet classifier on 10,000s of images has a train-time measured in hours. This is true for both the MLP / EfficientNet case and the LR / VGG16 case, ~~although MLP / EfficientNet is the faster case~~ (Edit: actually it's not faster; see next post).
As we get larger sources on CoralNet, or as we start to explore having super-classifiers trained on multiple sources' data, the long train times will become more of a concern. Mostly for usability, but eventually also server load at a certain point.
We could encourage more folks to switch from VGG16 to EfficientNet if we improve train time + allow more frequent training in the latter case. (Since VGG16 is deprecated and seems more of a pain to work with, I assume trying to speed that up is not worth it.)

David said an idea from 2020 was to switch from logistic regression to an MLP when the size is large (don't know exact threshold). I'm not sure if this is the actual logic in train_utils.py et al. at the moment; it looks to me like it just uses MLP if EfficientNet, and LR if VGG16? But I could be wrong. Either way, it looks to be using MLP for large training jobs (assuming EfficientNet), which is the concern this issue is highlighting.

Other notes from David:

Training the MLP just uses sklearn MLPClassifier on CPU. We didn't explore training time for this, and except for learning rate, the parameters are vanilla.
Speeding up MLP training would involve:
1. Some benchmark problems.
2. Benchmarking existing code
3. Benchmarking existing code on some different instances (Maybe we can get a reasonable speedup by just using faster machines).
4. Stopping criteria for training.
5. Pytorch or fastai on GPU -- sklearn is pure CPU. A two layer MLP is not very big generally, but our input is 1280 and the hidden size is 100, which means > 100k weights for first layer (second layer is smaller) So, not tiny.
Don't need to deliver a research paper on this. Just some systematic experimentation to improve speed while maintaining the same level accuracy and having confidence that it's delivering.

StephenChan commented 1 year ago

Our slowest training times for EfficientNet sources:

s372: 35k train images * 10 pts = ~350k examples, 7h45m
s3058: 20k train images * 5 pts = ~100k examples, 4h56m
s3363: 14k train images * 50 pts = ~700k examples, 4h23m
s3371: 14k train images * 10 pts = ~140k examples, 4h15m
s3411: 14k train images * 50 pts = ~700k examples, 3h57m

Our slowest (completed) training times for VGG16 sources, with dates included to show that some have been using CoralNet 1.0 specs:

s1052: 40k train images * 30 pts = ~1.2m examples, 15h08m, 2020/10
s295: 57k train images * 10 pts = ~570k examples, 14h54m, 2019/11
s1374: 16k train images * 30 pts = ~480k examples, 9h35m, 2021/03
s1656: 16k train images * 50 pts = ~800k examples, 9h14m, 2021/09
s371: 25k train images * 20 pts = ~500k examples, 7h39m, 2020/04

(EDIT: Multiplied images by points to get number of examples, although that may not be exact since the number of points is just the source's default.)

So, VGG16 sources' trainings took somewhere between a factor of 1 to 2 longer than EfficientNet sources' trainings, given the same number of train images. Overall, both can get slow. We can ballpark both as 1 second per training image.

Additionally, as of https://github.com/beijbom/coralnet/pull/409 (merged into production September 2021) we were able to see trainings that started but never completed (as Classifiers stuck in PENDING status). When I checked last month (December 2022) there were stuck trainings in 3 sources: 1052, 1374, and 1656, all VGG16. I don't know if this means those stuck trainings were going even longer than 15 hours, but it seems to be a problem correlated with long training times.

kriegman commented 1 year ago

For the record, you should multiply number of images times pts since that’s the number of examples used to train the classifier. The biggest one has 1.2M examples.

D

On Fri, Jan 20, 2023 at 4:43 PM StephenChan @.***> wrote:

Our slowest training times for EfficientNet sources:

s372: 35k train images * 10 pts, 7h45m

s3058: 20k train images * 5 pts, 4h56m

s3363: 14k train images * 50 pts, 4h23m

s3371: 14k train images * 10 pts, 4h15m

s3411: 14k train images * 50 pts, 3h57m

Our slowest (completed) training times for VGG16 sources, with dates included to show that some have been using CoralNet 1.0 specs:

s1052: 40k train images * 30 pts, 15h08m, 2020/10

s295: 57k train images * 10 pts, 14h54m, 2019/11

s1374: 16k train images * 30 pts, 9h35m, 2021/03

s1656: 16k train images * 50 pts, 9h14m, 2021/09

s371: 25k train images * 20 pts, 7h39m, 2020/04

So, VGG16 sources' trainings took somewhere between a factor of 1 to 2 longer than EfficientNet sources' trainings, given the same number of train images. Overall, both can get slow.

Additionally, as of beijbom/coralnet#409 https://urldefense.com/v3/__https://github.com/beijbom/coralnet/pull/409__;!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDAuFBwW1$ (merged into production September 2021) we were able to see trainings that started but never completed (as Classifiers stuck in PENDING status). When I checked last month (December 2022) there were stuck trainings in 3 sources: 1052, 1374, and 1656, all VGG16. I don't know if this means those stuck trainings were going even longer than 15 hours, but it seems to be a problem correlated with long training times.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/53*issuecomment-1399104362__;Iw!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDO8iw2jj$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AKSK5TCWRBDWKN7QUTWTMWKPANCNFSM6AAAAAAUA3E3KQ__;!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDDfs_T6g$ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sent from Gmail Mobile

StephenChan commented 1 year ago

OK, edited to include that. Though note that the point count is just the source default, and may not necessarily be applied to all images in the source.

kriegman commented 1 year ago

Ok. Good point. It’s really the total number of verified points. I was assuming that all points in all images were verified, but the number of images times points per image are proxies.

On Fri, Jan 20, 2023 at 9:21 PM StephenChan @.***> wrote:

OK, edited to include that. Though note that the point count is just the source default, and may not necessarily be applied to all images in the source.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/53*issuecomment-1399182776__;Iw!!Mih3wA!BnwAs-Iv47CJrhVaUNI0FZdwejpxi2btQ9k9P5P8RJjbWDXAkeWwGtCSRykKLWS9MAJLWgsB23lbQj--o5rKClce$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5ALJWYIJSTVCQJ4W4F3WTNW6ZANCNFSM6AAAAAAUA3E3KQ__;!!Mih3wA!BnwAs-Iv47CJrhVaUNI0FZdwejpxi2btQ9k9P5P8RJjbWDXAkeWwGtCSRykKLWS9MAJLWgsB23lbQj--o4k6wfij$ . You are receiving this because you commented.Message ID: @.***>

-- Sent from Gmail Mobile

StephenChan commented 1 year ago

Those train-images counts only include images which have all points verified (confirmed). So that part has no ambiguity at least.

I was more pointing out that the source's default number of points may have changed during the source's lifetime, or maybe image-annotations were imported which didn't adhere to that default.

(Alternatively, I could count each source's point objects directly, but that's pretty slow in the DB.)

StephenChan commented 8 months ago

As of the past day, I can confirm that there isn't anything about the training code that makes large jobs impossible to complete (at least, large by coralnet standards). They just need to be allowed to run long enough, up to ~50 hours in the cases of sources 295 and 1656. No coralnet trainings are failing/stuck due to largeness currently.

A few optimization ideas I have at the moment:

Cache feature vectors to local filesystem for quicker access in later training epochs. Though this needs enough filesystem space; 50k feature vectors = roughly 50 GB. Potentially relevant link here on specifying storage volumes for Batch jobs (we may want different volume sizes for different jobs).
Try to reuse S3 connections as long as possible instead of using one connection per feature vector download
Have the ability to reduce epoch count on the fly, based on amount of accuracy improvement on the reference set (something like "early stopping"?)

StephenChan commented 8 months ago

support for large training jobs

StephenChan commented 4 months ago

Once PR #80 or equivalent is merged, think we'll have appreciable enough performance improvements to close this issue; stacking a x6 speedup from that and x2 from PR #77. Any further speedup ideas can go in new issues/PRs.

StephenChan commented 4 months ago

Once PR https://github.com/coralnet/pyspacer/pull/80 or equivalent is merged, think we'll have appreciable enough performance improvements to close this issue

And now that PR is merged.

coralnet / pyspacer

Try to speed up large training jobs #53