Closed StephenChan closed 4 months ago
Our slowest training times for EfficientNet sources:
Our slowest (completed) training times for VGG16 sources, with dates included to show that some have been using CoralNet 1.0 specs:
(EDIT: Multiplied images by points to get number of examples, although that may not be exact since the number of points is just the source's default.)
So, VGG16 sources' trainings took somewhere between a factor of 1 to 2 longer than EfficientNet sources' trainings, given the same number of train images. Overall, both can get slow. We can ballpark both as 1 second per training image.
Additionally, as of https://github.com/beijbom/coralnet/pull/409 (merged into production September 2021) we were able to see trainings that started but never completed (as Classifiers stuck in PENDING status). When I checked last month (December 2022) there were stuck trainings in 3 sources: 1052, 1374, and 1656, all VGG16. I don't know if this means those stuck trainings were going even longer than 15 hours, but it seems to be a problem correlated with long training times.
For the record, you should multiply number of images times pts since that’s the number of examples used to train the classifier. The biggest one has 1.2M examples.
D
On Fri, Jan 20, 2023 at 4:43 PM StephenChan @.***> wrote:
Our slowest training times for EfficientNet sources:
- s372: 35k train images * 10 pts, 7h45m
- s3058: 20k train images * 5 pts, 4h56m
- s3363: 14k train images * 50 pts, 4h23m
- s3371: 14k train images * 10 pts, 4h15m
- s3411: 14k train images * 50 pts, 3h57m
Our slowest (completed) training times for VGG16 sources, with dates included to show that some have been using CoralNet 1.0 specs:
- s1052: 40k train images * 30 pts, 15h08m, 2020/10
- s295: 57k train images * 10 pts, 14h54m, 2019/11
- s1374: 16k train images * 30 pts, 9h35m, 2021/03
- s1656: 16k train images * 50 pts, 9h14m, 2021/09
- s371: 25k train images * 20 pts, 7h39m, 2020/04
So, VGG16 sources' trainings took somewhere between a factor of 1 to 2 longer than EfficientNet sources' trainings, given the same number of train images. Overall, both can get slow.
Additionally, as of beijbom/coralnet#409 https://urldefense.com/v3/__https://github.com/beijbom/coralnet/pull/409__;!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDAuFBwW1$ (merged into production September 2021) we were able to see trainings that started but never completed (as Classifiers stuck in PENDING status). When I checked last month (December 2022) there were stuck trainings in 3 sources: 1052, 1374, and 1656, all VGG16. I don't know if this means those stuck trainings were going even longer than 15 hours, but it seems to be a problem correlated with long training times.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/53*issuecomment-1399104362__;Iw!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDO8iw2jj$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AKSK5TCWRBDWKN7QUTWTMWKPANCNFSM6AAAAAAUA3E3KQ__;!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDDfs_T6g$ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Sent from Gmail Mobile
OK, edited to include that. Though note that the point count is just the source default, and may not necessarily be applied to all images in the source.
Ok. Good point. It’s really the total number of verified points. I was assuming that all points in all images were verified, but the number of images times points per image are proxies.
On Fri, Jan 20, 2023 at 9:21 PM StephenChan @.***> wrote:
OK, edited to include that. Though note that the point count is just the source default, and may not necessarily be applied to all images in the source.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/53*issuecomment-1399182776__;Iw!!Mih3wA!BnwAs-Iv47CJrhVaUNI0FZdwejpxi2btQ9k9P5P8RJjbWDXAkeWwGtCSRykKLWS9MAJLWgsB23lbQj--o5rKClce$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5ALJWYIJSTVCQJ4W4F3WTNW6ZANCNFSM6AAAAAAUA3E3KQ__;!!Mih3wA!BnwAs-Iv47CJrhVaUNI0FZdwejpxi2btQ9k9P5P8RJjbWDXAkeWwGtCSRykKLWS9MAJLWgsB23lbQj--o4k6wfij$ . You are receiving this because you commented.Message ID: @.***>
-- Sent from Gmail Mobile
Those train-images counts only include images which have all points verified (confirmed). So that part has no ambiguity at least.
I was more pointing out that the source's default number of points may have changed during the source's lifetime, or maybe image-annotations were imported which didn't adhere to that default.
(Alternatively, I could count each source's point objects directly, but that's pretty slow in the DB.)
As of the past day, I can confirm that there isn't anything about the training code that makes large jobs impossible to complete (at least, large by coralnet standards). They just need to be allowed to run long enough, up to ~50 hours in the cases of sources 295 and 1656. No coralnet trainings are failing/stuck due to largeness currently.
A few optimization ideas I have at the moment:
Once PR #80 or equivalent is merged, think we'll have appreciable enough performance improvements to close this issue; stacking a x6 speedup from that and x2 from PR #77. Any further speedup ideas can go in new issues/PRs.
Once PR https://github.com/coralnet/pyspacer/pull/80 or equivalent is merged, think we'll have appreciable enough performance improvements to close this issue
And now that PR is merged.
Motivation:
Training a CoralNet classifier on 10,000s of images has a train-time measured in hours. This is true for both the MLP / EfficientNet case and the LR / VGG16 case,
although MLP / EfficientNet is the faster case(Edit: actually it's not faster; see next post).As we get larger sources on CoralNet, or as we start to explore having super-classifiers trained on multiple sources' data, the long train times will become more of a concern. Mostly for usability, but eventually also server load at a certain point.
We could encourage more folks to switch from VGG16 to EfficientNet if we improve train time + allow more frequent training in the latter case. (Since VGG16 is deprecated and seems more of a pain to work with, I assume trying to speed that up is not worth it.)
David said an idea from 2020 was to switch from logistic regression to an MLP when the size is large (don't know exact threshold). I'm not sure if this is the actual logic in train_utils.py et al. at the moment; it looks to me like it just uses MLP if EfficientNet, and LR if VGG16? But I could be wrong. Either way, it looks to be using MLP for large training jobs (assuming EfficientNet), which is the concern this issue is highlighting.
Other notes from David:
Training the MLP just uses sklearn MLPClassifier on CPU. We didn't explore training time for this, and except for learning rate, the parameters are vanilla.
Speeding up MLP training would involve:
Don't need to deliver a research paper on this. Just some systematic experimentation to improve speed while maintaining the same level accuracy and having confidence that it's delivering.