coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
6 stars 2 forks source link

Try to speed up large training jobs #53

Closed StephenChan closed 4 months ago

StephenChan commented 1 year ago

Motivation:

David said an idea from 2020 was to switch from logistic regression to an MLP when the size is large (don't know exact threshold). I'm not sure if this is the actual logic in train_utils.py et al. at the moment; it looks to me like it just uses MLP if EfficientNet, and LR if VGG16? But I could be wrong. Either way, it looks to be using MLP for large training jobs (assuming EfficientNet), which is the concern this issue is highlighting.

Other notes from David:

StephenChan commented 1 year ago

Our slowest training times for EfficientNet sources:

Our slowest (completed) training times for VGG16 sources, with dates included to show that some have been using CoralNet 1.0 specs:

(EDIT: Multiplied images by points to get number of examples, although that may not be exact since the number of points is just the source's default.)

So, VGG16 sources' trainings took somewhere between a factor of 1 to 2 longer than EfficientNet sources' trainings, given the same number of train images. Overall, both can get slow. We can ballpark both as 1 second per training image.

Additionally, as of https://github.com/beijbom/coralnet/pull/409 (merged into production September 2021) we were able to see trainings that started but never completed (as Classifiers stuck in PENDING status). When I checked last month (December 2022) there were stuck trainings in 3 sources: 1052, 1374, and 1656, all VGG16. I don't know if this means those stuck trainings were going even longer than 15 hours, but it seems to be a problem correlated with long training times.

kriegman commented 1 year ago

For the record, you should multiply number of images times pts since that’s the number of examples used to train the classifier. The biggest one has 1.2M examples.

D

On Fri, Jan 20, 2023 at 4:43 PM StephenChan @.***> wrote:

Our slowest training times for EfficientNet sources:

  • s372: 35k train images * 10 pts, 7h45m
  • s3058: 20k train images * 5 pts, 4h56m
  • s3363: 14k train images * 50 pts, 4h23m
  • s3371: 14k train images * 10 pts, 4h15m
  • s3411: 14k train images * 50 pts, 3h57m

Our slowest (completed) training times for VGG16 sources, with dates included to show that some have been using CoralNet 1.0 specs:

  • s1052: 40k train images * 30 pts, 15h08m, 2020/10
  • s295: 57k train images * 10 pts, 14h54m, 2019/11
  • s1374: 16k train images * 30 pts, 9h35m, 2021/03
  • s1656: 16k train images * 50 pts, 9h14m, 2021/09
  • s371: 25k train images * 20 pts, 7h39m, 2020/04

So, VGG16 sources' trainings took somewhere between a factor of 1 to 2 longer than EfficientNet sources' trainings, given the same number of train images. Overall, both can get slow.

Additionally, as of beijbom/coralnet#409 https://urldefense.com/v3/__https://github.com/beijbom/coralnet/pull/409__;!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDAuFBwW1$ (merged into production September 2021) we were able to see trainings that started but never completed (as Classifiers stuck in PENDING status). When I checked last month (December 2022) there were stuck trainings in 3 sources: 1052, 1374, and 1656, all VGG16. I don't know if this means those stuck trainings were going even longer than 15 hours, but it seems to be a problem correlated with long training times.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/53*issuecomment-1399104362__;Iw!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDO8iw2jj$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AKSK5TCWRBDWKN7QUTWTMWKPANCNFSM6AAAAAAUA3E3KQ__;!!Mih3wA!DWVkWTa4cfGhFud3Qt6k9ENzYbLupWYWbxiv7Ov-x0XuoWusXeNvbMlPijKXvnYV94o0JB7prOjzISWlDDfs_T6g$ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sent from Gmail Mobile

StephenChan commented 1 year ago

OK, edited to include that. Though note that the point count is just the source default, and may not necessarily be applied to all images in the source.

kriegman commented 1 year ago

Ok. Good point. It’s really the total number of verified points. I was assuming that all points in all images were verified, but the number of images times points per image are proxies.

On Fri, Jan 20, 2023 at 9:21 PM StephenChan @.***> wrote:

OK, edited to include that. Though note that the point count is just the source default, and may not necessarily be applied to all images in the source.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/53*issuecomment-1399182776__;Iw!!Mih3wA!BnwAs-Iv47CJrhVaUNI0FZdwejpxi2btQ9k9P5P8RJjbWDXAkeWwGtCSRykKLWS9MAJLWgsB23lbQj--o5rKClce$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5ALJWYIJSTVCQJ4W4F3WTNW6ZANCNFSM6AAAAAAUA3E3KQ__;!!Mih3wA!BnwAs-Iv47CJrhVaUNI0FZdwejpxi2btQ9k9P5P8RJjbWDXAkeWwGtCSRykKLWS9MAJLWgsB23lbQj--o4k6wfij$ . You are receiving this because you commented.Message ID: @.***>

-- Sent from Gmail Mobile

StephenChan commented 1 year ago

Those train-images counts only include images which have all points verified (confirmed). So that part has no ambiguity at least.

I was more pointing out that the source's default number of points may have changed during the source's lifetime, or maybe image-annotations were imported which didn't adhere to that default.

(Alternatively, I could count each source's point objects directly, but that's pretty slow in the DB.)

StephenChan commented 8 months ago

As of the past day, I can confirm that there isn't anything about the training code that makes large jobs impossible to complete (at least, large by coralnet standards). They just need to be allowed to run long enough, up to ~50 hours in the cases of sources 295 and 1656. No coralnet trainings are failing/stuck due to largeness currently.

A few optimization ideas I have at the moment:

StephenChan commented 8 months ago

support for large training jobs

StephenChan commented 4 months ago

Once PR #80 or equivalent is merged, think we'll have appreciable enough performance improvements to close this issue; stacking a x6 speedup from that and x2 from PR #77. Any further speedup ideas can go in new issues/PRs.

StephenChan commented 4 months ago

Once PR https://github.com/coralnet/pyspacer/pull/80 or equivalent is merged, think we'll have appreciable enough performance improvements to close this issue

And now that PR is merged.