efficientnet feature extractor

qiminchen commented 4 years ago

New FeatureExtractor

[x] Implement a FeatureExtractor in spacer/extract_features.py

Utils

[x] torch_utils.py
[x] extract_features.py
[x] extract_features_utils.py
[x] features_extracting_training.py
[x] scripts_utils.py
[x] efficientnet.py
[x] efficientnet_utils.py

Unit test

[x] test_torch_utils.py
[x] test_extract_features.py
[x] test_extract_features_utils.py
[x] test_models.py

Typing hints and PEP8 cleanup

scripts
- [x] classifier_training_regression.py
- [x] features_extracting_training.py
- [x] scripts_utils.py
spacer
- [x] caffe_utils.py
- [x] torch_utils.py
- [x] extract_features_utils.py
- [x] train_utils.py

qiminchen commented 4 years ago

@beijbom hey Oscar, all the unit tests are passed on my local but The Travis CI build failed b/c of the permission issue, do you know how to solve this?

FAIL: test_gray (spacer.tests.test_torch_utils.TestClassifyFromPatchList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/beijbom/pyspacer/spacer/tests/test_torch_utils.py", line 42, in setUp
    'efficientnetb0_5eps_best.pt')
  File "/home/travis/build/beijbom/pyspacer/spacer/storage.py", line 186, in download_model
    assert config.HAS_S3_MODEL_ACCESS, "Need access to model bucket."
AssertionError: Need access to model bucket.
======================================================================
FAIL: test_rgb (spacer.tests.test_torch_utils.TestClassifyFromPatchList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/beijbom/pyspacer/spacer/tests/test_torch_utils.py", line 42, in setUp
    'efficientnetb0_5eps_best.pt')
  File "/home/travis/build/beijbom/pyspacer/spacer/storage.py", line 186, in download_model
    assert config.HAS_S3_MODEL_ACCESS, "Need access to model bucket."
AssertionError: Need access to model bucket.

beijbom commented 4 years ago

hey @qiminchen : I had disabled s3 access for Travis due to the SQS tests. But since those are skipped now, i re-enalbled it. But also, you should use the @unittest.skipUnless(config.HAS_S3_MODEL_ACCESS, 'No access to models') decorator for tests that require model access, since we can't assume all developers have that access. That will skip the test rather than fail out.

A few more notes looking from glancing at your code:

Please re-base to master now that I merged my big PR.
You don't have to follow the exact structure of the caffe feature extractor as long as you implement the interface. For example, you don't have to call your method classify_from_patchlist. Although, if it is convenient, by all means use it. But I do think it's pretty confusing for the reader to have two methods that have the same exact name but work on different libraries. Better then to make a new shared classify_from_patchlist method that is a shared util agnostic to the feature extractor. Or perhaps make a method im_to_patchlist that only does the cropping and pre-processing and then write new methods classify_from_patchlist that take in the patchlist. Sorry, this comment got a bit confusing. Perhaps take a stab at finishing this up after rebasing to new master and then I can add more specific comments.
If there are utility function in caffe_utils that you want to use, perhaps move to a shared extract_features_utils file?
All imports go at the beginning of the file. The only reason I broke this rule is that caffe isn't available in all builds (pip install and clone) so I had to protect those imports.

beijbom commented 4 years ago

@qiminchen : CI pipeline permission updated.

beijbom commented 4 years ago

One more suggestion @qiminchen : can you write a script (pls. put in in the scripts folder), that runs this new feature extractor together with the classifier training on the exported spacer trainingdata and confirms that it get similar (or higher) performance than production. That would function as a regression test basically.

qiminchen commented 4 years ago

@beijbom hey Oscar, thanks for the review

Please re-base to master now that I merged my big PR.

Done

If there are utility functions in caffe_utils that you want to use, perhaps move to a shared extract_features_utils file?

I moved gray2rgb(), crop_patch() and crop_simple() to the shared extract_features_utils.py
torch_utils.py only serves as feature extraction

so in extract_features.py, it crops the image to patch list, calls extract_feature() from torch_utils.py and return ImageFeatures and ExtractFeaturesReturnMsg

All imports go at the beginning of the file. The only reason I broke this rule is that caffe isn't available in all builds (pip install and clone) so I had to protect those imports.

All imports go at the beginning.

One more suggestion @qiminchen : can you write a script (pls. put in in the scripts folder), that runs this new feature extractor together with the classifier training on the exported spacer trainingdata and confirms that it get similar (or higher) performance than production. That would function as a regression test basically.

Sure, I will work on it. By the way, I need to add unittest for EfficientNetExtractor in test_extract_features.py and more specifically I should generate features_legacy for test_regression and upload to the s3 right?

beijbom commented 4 years ago

@qiminchen : sounds good. I added some comments, but let me know when you want me to take a close look again. It's not a bad idea to write a regression test. You can go ahead and upload any test fixtures you want to s3.

Re. the shared utils, please remove them from caffe_utils so we don't duplicate code. Also move tests. And please feel free to refactor the caffe extractor also as you are cleaning up and reorganizing the code.

qiminchen commented 4 years ago

One more suggestion @qiminchen : can you write a script (pls. put in in the scripts folder), that runs this new feature extractor together with the classifier training on the exported spacer trainingdata and confirms that it get similar (or higher) performance than production. That would function as a regression test basically.

@beijbom Oscar, can you refer me to the files used for extracting features of a source, looks like scripts/classifier_training_regression.py downloads the features to local and trains the classifier. What about extracting the features and saving to the temporary directory that you used for VGG16?

For TestEfficientNetExtractor, I was able to pass all the cases in local but it fails at Travis CI, it might be because of the libjpeg version?

Traceback (most recent call last):
  File "/home/travis/build/beijbom/pyspacer/spacer/tests/test_extract_features.py", line 295, in test_regression
    self.assertEqual(pf_new.data, pf_legacy.data)
AssertionError: Lists differ: [1.6237620115280151, -0.1788397878408432, -0.18572[27272 chars]3397] != [1.6237621307373047, -0.17883972823619843, -0.1857[27283 chars]7278]
First differing element 0:
1.6237620115280151
1.6237621307373047

beijbom commented 4 years ago

One more suggestion @qiminchen : can you write a script (pls. put in in the scripts folder), that runs this new feature extractor together with the classifier training on the exported spacer trainingdata and confirms that it get similar (or higher) performance than production. That would function as a regression test basically.

@beijbom Oscar, can you refer me to the files used for extracting features of a source, looks like scripts/classifier_training_regression.py downloads the features to local and trains the classifier.

Yeah, I download them b/c we run 5 epochs, so it's faster to download once and then run locally. You don't have to do that if you don't want to.

What about extracting the features and saving to the temporary directory that you used for VGG16?

Hmm. I'm not sure what you mean. For extracting features, just compose a msg = ExtractFeaturesMsg() and then call tasks.extract_features(msg). The beauty of the new DataLocation design is that image_loc and feature_loc can point to the same storage system, but they don't have to do so. You can have the task load from the S3 bucket and write to local disk. What ever you prefer. So one way is to download also the image files, and then extract features for all images once they are local. Another way is to point image_loc to s3 and feature_loc to the local folder.

For TestEfficientNetExtractor, I was able to pass all the cases in local but it fails at Travis CI, it might be because of the libjpeg version?

Traceback (most recent call last):
  File "/home/travis/build/beijbom/pyspacer/spacer/tests/test_extract_features.py", line 295, in test_regression
    self.assertEqual(pf_new.data, pf_legacy.data)
AssertionError: Lists differ: [1.6237620115280151, -0.1788397878408432, -0.18572[27272 chars]3397] != [1.6237621307373047, -0.17883972823619843, -0.1857[27283 chars]7278]
First differing element 0:
1.6237620115280151
1.6237621307373047

I don't think this has to do with libjpeg. When you compare float32 vectors you can't expect identical results. Instead use a test that is robust to float precision errors. For example, you can borrow from what I did here: https://github.com/beijbom/pyspacer/blob/master/spacer/tests/test_beta_regression.py#L141

beijbom commented 4 years ago

One thing to note about Travis CI. Some unit-tests that involved fetching stuff from S3 do fail intermittently. So if you see fails related to that ping me and I can restart that job on Travis. In the long term we should probably move to using local test fixtures.

qiminchen commented 4 years ago

@beijbom hey Oscar, the script that runs the new feature extractor together with the classifier training on the exported spacer trainingdata is ready and I tested it on s294, the performance is better than the production as we're using EfficientNet-b0. The new repo is a legend!

-> Downloading data for source id: 294.
-> Downloading 201 metadata...
-> Assembling train and val data for source id: 294
-> Training...
-> Trainset: 157, valset: 18 images
-> Using 25 images per mini-batch and 7 mini-batches per epoch
-> Trainset: 20, valset: 9, common: 8 labels
-> Loading reference data.
-> Online training...
-> Epoch 0, acc: 0.8368991386496248
-> Epoch 1, acc: 0.8388441233676021
-> Epoch 2, acc: 0.8399555432064463
-> Epoch 3, acc: 0.8396776882467352
-> Epoch 4, acc: 0.8407891080855793
-> Epoch 5, acc: 0.8405112531258683
-> Epoch 6, acc: 0.8421783828841345
-> Epoch 7, acc: 0.8421783828841345
-> Epoch 8, acc: 0.8419005279244235
-> Epoch 9, acc: 0.8424562378438455
-> Calibrating.
-> Re-trained LTER Back Reef (294). Old acc: 80.0, new acc: 86.8

I tried to make the script more flexible by:

def main(extractor_name, source_id, n_epochs):
    extractor = ExtractFeatures(extractor_name)
    extractor(source_id)
    classifier = TrainClassifier()
    classifier(source_id, n_epochs)

if __name__ == '__main__':
    fire.Fire(main)

then I ran into some relative import issue and haven't found a better way to solve it, e.g.

Traceback (most recent call last):
  File "scripts/features_extracting_training.py", line 16, in <module>
    from spacer import config
ModuleNotFoundError: No module named 'spacer'

Do you have any ideas? Other than that, this PR should be ready for review.

qiminchen commented 4 years ago

@beijbom hey Oscar, here are some updates on the progress:

Can you go through and add Typing hints to all utils? I was a bit lazy previously, but we might as well do it.

I should've completed adding type hints to files that needed. Please let me know if I miss any and feel free to add the files that needed type hints and PEP8 cleanup to the checklist in the beginning.

Code coverage for the files in /models folder is not 100%. Can you add test to cover those lines?

Added, should be coveraged now, already tested in local

For the scripts, I think there are some duplicated code between features_extracting_training.py and classifier_training_regression.py. Can you move the shared code to a new scripts.utils.py?

Added. I moved the build_traindata() and start_training() to the scripts_utils.py and cleaned up both features_extracting_training.py and classifier_training_regression.py.

The reason I didn't move _cache() to scripts_utils.py is that classifier_training_regression.py downloads all the files to local (features.json, anns.json and image.meta.json) while features_extracting_training.py only downloads anns.json and image.meta.json as features are downloaded when being extracted.

As per your question. Hmm. That should work fine. Can you push that change so I can see if I can replicate the problem? My guess is that you may have to add spacer to your PYTHONPATH.

Thanks for the help, I solved this issue by adding the spacer to $PYTHONPATH, so config.LOCAL_FEATURE_DIR goes to the CLI, you can use

$ python scripts/features_extracting_training.py efficientnet_b0_ver1 294 10 \ /path/to/local/path/for/saving/features

where efficientnet_b0_ver1 294 is extractor_name, 294 is source_id and 10 is the number of training epochs, to run the features_extracting_training.py

actually, it does not pass. Can you try using png?

Actually the images I used for generating the legacy features are already png, and the original image for production testing is also png so technically it should work. Weird... still debugging...

beijbom commented 4 years ago

@qiminchen : Great job pushing this PR through. Feel free to merge this to master when you are ready.

StephenChan commented 4 years ago

Nice work on this!

coralnet / pyspacer

efficientnet feature extractor #15