New EfficientNet-B0 / MLP classifier deployment

qiminchen commented 3 years ago

~~Upload new EfficientNet-bo well-trained weights~~

[x] Add MLP classifier
[x] Unit test for MLP classifier hybrid mode
[x] Unit tests all passed on local
[x] Unit tests all passed on Docker
[x] Refactor regression folder

qiminchen commented 3 years ago

hey @beijbom, I added the MLP classifier by adding one more argument clf_type: str # Name of the classifier to use, tested on both local and docker, passed most test cases except this one. I don't think this failure is caused by adding the MLP. Do you have any ideas?

======================================================================
FAIL: test_img_classify_bad_url (spacer.tests.test_mailman.TestProcessJobErrorHandling)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/spacer/spacer/tests/test_mailman.py", line 62, in test_img_classify_bad_url
    self.assertTrue('URLError' in return_msg.error_message)
AssertionError: False is not true

I set the default clf_type='MLP and ran the python scripts/regression/efficientnet_extractor.py efficientnet_b0_ver1 294 10 MLP /path/to/features to test both EfficientNet-b0 feature extraction and MLP classifier training, it worked pretty well. You could also test on your local. Changes are ready for review.

beijbom commented 3 years ago

Hey. Looks like some other tests fail on Travis: https://travis-ci.org/github/beijbom/pyspacer/jobs/725454742

beijbom commented 3 years ago

Re. the error. I'm not sure. Can you print the error message (return_msg.error_message)? Perhaps the formatting changed slightly so it doesn't match what I check against?

qiminchen commented 3 years ago

Re. the error. I'm not sure. Can you print the error message (return_msg.error_message)? Perhaps the formatting changed slightly so it doesn't match what I check against?

I know what the error is, working on it

beijbom commented 3 years ago

Btw. I updated CI to use travis-ci.com. CI callbacks seem to be back up again.

qiminchen commented 3 years ago

@beijbom It should work well now

beijbom commented 3 years ago

@qiminchen : I took a look -- this looks nice in general. I ran

% python scripts/regression/train_classifier.py train 1498 /data/tmp
-> Downloading data for source id: 1498.
-> Downloading 804 metadata and feature files...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 804/804 [03:08<00:00,  4.26it/s]
-> Assembling train and val data for source id: 1498
-> Training...
-> Re-trained SpatSurvey (1498). Old acc: 76.0, new acc: 72.5

which gave lower performance than the previous setup. Do you mind running a few sources and pasting the results here? I just like to confirm that this is an outlier.

Also: it seems we should be able to clean up the regression folder a bit. For example, there are two private methods for caching local, one in efficientnet_extractor.py and one in train_classifier.py. I know they do slightly different things, but can you see if it can be moved to a shared utils.py?

qiminchen commented 3 years ago

@qiminchen : I took a look -- this looks nice in general. I ran

% python scripts/regression/train_classifier.py train 1498 /data/tmp
-> Downloading data for source id: 1498.
-> Downloading 804 metadata and feature files...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 804/804 [03:08<00:00,  4.26it/s]
-> Assembling train and val data for source id: 1498
-> Training...
-> Re-trained SpatSurvey (1498). Old acc: 76.0, new acc: 72.5

which gave lower performance than the previous setup. Do you mind running a few sources and pasting the results here? I just like to confirm that this is an outlier.

@beijbom I think I found the problem. So train_classifier.py downloads the data from spacer-trainingdata bucket where the features were extracted by the VGG16CaffeExtractor. I then trained the features with LR and MLP respectively, and here is the comparison.

LR

-> Downloading data for source id: 1498.
-> Downloading 804 metadata and feature files...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 804/804 [02:12<00:00,  6.09it/s]
-> Assembling train and val data for source id: 1498
-> Training...
-> Re-trained SpatSurvey (1498). Old acc: 76.0, new acc: 71.5

MLP

-> Downloading data for source id: 1498.
-> Downloading 804 metadata and feature files...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 804/804 [00:00<00:00, 195922.64it/s]
-> Assembling train and val data for source id: 1498
-> Training...
-> Re-trained SpatSurvey (1498). Old acc: 76.0, new acc: 72.6

So I think it's because the current backend s1498 classifier (with 76.0% acc) was trained on 169 images at 06/22/2019 while the one you test was trained on 268 images, which means that the extra images uploaded by the user didn't help improve the performance, otherwise the training classifier status should be on the chart.

While efficientnet_extractor.py extracts the new features using EfficientNetExtractor and trains the classifier using either LR or MLP, here are the training results using LR and MLP.

LR

python scripts/regression/efficientnet_extractor.py efficientnet_b0_ver1 1498 10 LR /data/tmp
-> Fetching 1498 image and annotation meta files...
-> Extracting features...
-> Downloading data for source id: 1498.
-> Downloading 268 metadata...
-> Assembling train and val data for source id: 1498
-> Training...
-> Re-trained SpatSurvey (1498). Old acc: 76.0, new acc: 78.9

MLP

python scripts/regression/efficientnet_extractor.py efficientnet_b0_ver1 1498 10 MLP /data/tmp
-> Fetching 1498 image and annotation meta files...
-> Extracting features...
-> Downloading data for source id: 1498.
-> Downloading 268 metadata...
-> Assembling train and val data for source id: 1498
-> Training...
-> Re-trained SpatSurvey (1498). Old acc: 76.0, new acc: 79.3

I will test on a few more sources that are not presented in the 26 test set and paste the results here.

Also: it seems we should be able to clean up the regression folder a bit. For example, there are two private methods for caching local, one in efficientnet_extractor.py and one in train_classifier.py. I know they do slightly different things, but can you see if it can be moved to a shared utils.py?

Good idea, working on it.

beijbom commented 3 years ago

Thanks for looking into that @qiminchen . I agree this is probably b/c different number of imgs in the train data. The results on new features + MLP looks nice!

beijbom commented 3 years ago

@qiminchen : is this ready for final review? I just added a minor comment -- let's try to get this merged this weekend.

qiminchen commented 3 years ago

@qiminchen : is this ready for final review? I just added a minor comment -- let's try to get this merged this weekend.

yes, other than the one you just added, lets get this one done and it's ready for review

beijbom commented 3 years ago

Both

python train_classifier.py train 294 /data/tmp --clf_type LR
python train_classifier.py train 294 /data/tmp --clf_type MLP

fails with some error. @qiminchen : do you see the same error?

qiminchen commented 3 years ago

Both
python train_classifier.py train 294 /data/tmp --clf_type LR
python train_classifier.py train 294 /data/tmp --clf_type MLP
fails with some error. @qiminchen : do you see the same error?

hmm this is weird, i did not see the error.. whats the error on your side? Screenshot from 2020-09-20 10-04-50

beijbom commented 3 years ago

@qiminchen : I think my error was to a feature file was only partway downloaded. I cleared the cache and it works now. I took a final pass to merge the two regression scripts. I think it's a bit cleaner now -- lmk what you think. I'm running some final tests -- once complete this is good to merge.

beijbom commented 3 years ago

One question: what is the recommended default number of epochs to train the MLP in your experiments?

qiminchen commented 3 years ago

@qiminchen : I think my error was to a feature file was only partway downloaded. I cleared the cache and it works now. I took a final pass to merge the two regression scripts. I think it's a bit cleaner now -- lmk what you think. I'm running some final tests -- once complete this is good to merge.

the cleanup looks great!

qiminchen commented 3 years ago

One question: what is the recommended default number of epochs to train the MLP in your experiments?

10, the reason I use 10 is that the reference code of training the classifier you sent me a long time ago uses 10

beijbom commented 3 years ago

One question: what is the recommended default number of epochs to train the MLP in your experiments?

10, the reason I use 10 is that the reference code of training the classifier you sent me a long time ago uses 10

Cool. That checks out.

qiminchen commented 3 years ago

One question: what is the recommended default number of epochs to train the MLP in your experiments?

10, the reason I use 10 is that the reference code of training the classifier you sent me a long time ago uses 10

Cool. That checks out.

hmm weird, I can't find the code you sent me before or the link to the code, but I'm pretty sure it was 10 epoch. 5 epoch should works as well.

coralnet / pyspacer

New EfficientNet-B0 / MLP classifier deployment #28