coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
6 stars 2 forks source link

AssertionError when retraining s2099 #38

Open qiminchen opened 3 years ago

qiminchen commented 3 years ago

@beijbom Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images
2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels
2021-01-08 11:44:47,479 Entering: loading of reference data
2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds.
Traceback (most recent call last):
  File "scripts/regression/retrain_source.py", line 106, in <module>
    fire.Fire()
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/regression/retrain_source.py", line 69, in train
    do_training(source_root, train_labels, val_labels, n_epochs, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training
    train_labels, val_labels, n_epochs, [], feature_loc, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__
    clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train
    refx, refy = load_batch_data(labels, ref_set, classes, feature_loc)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data
    x_, y_ = load_image_data(labels, imkey, classes, feature_loc)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data
    assert rc_labels_set.issubset(rc_features_set)
AssertionError

While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine using both features from the server or re-extracted features from EfficientNetb0.

To reproduce this AssertionError:

  1. clone the up-to-date pyspacer repo
  2. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
  3. run below to cache the features from bucket and retrain LR python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

Please let me know if you can reproduce the error.

kriegman commented 3 years ago

Qimin,

If a user has a source that's using the older VGG features, does coralnet train and create your MLP classifiers? Or does it still use the older LR code classifier? Could that be the problem?

Could the old classifiers be lost because the retraining treated it as a "new classifier with new features" as if the source was toggled to new features?

David

On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen notifications@github.com wrote:

@beijbom https://urldefense.com/v3/__https://github.com/beijbom__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2kTfdGBD$ Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR Downloading 11016 metadata and image/feature files... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s] Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images... Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099... 2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images 2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch 2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels 2021-01-08 11:44:47,479 Entering: loading of reference data 2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds. Traceback (most recent call last): File "scripts/regression/retrain_source.py", line 106, in fire.Fire() File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire target=component.name) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/regression/retrain_source.py", line 69, in train do_training(source_root, train_labels, val_labels, n_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training train_labels, val_labels, n_epochs, [], feature_loc, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in call clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train refx, refy = load_batch_data(labels, ref_set, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batchdata x, y_ = load_image_data(labels, imkey, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data assert rc_labels_set.issubset(rc_features_set) AssertionError

While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine.

To reproduce this AssertionError:

  1. clone the up-to-date pyspacer repo
  2. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24 https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py*L24__;Iw!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2sZkdfEf$
  3. run below to cache the features from bucket and retrain LR python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

Please let me know if you can reproduce the error.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/38__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2pNLDX4C$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AMPEKKDSYC2DQJZFG3SY5QW3ANCNFSM4V2453EA__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2g7_Yf_W$ .

qiminchen commented 3 years ago

@kriegman good question, I actually don't know the logic behind it,

does coralnet train and create your MLP classifiers? Or does it still use the older LR code classifier? Could that be the problem? Could the old classifiers be lost because the retraining treated it as a "new classifier with new features" as if the source was toggled to new features?

So for a source that already had a classifier, a new classifier will be trained when more images are added and if the accuracy of the newly trained classifier is higher than the old one, the old one will be replaced. But I'm not sure if VGG16 will be used or EfficientNet. I guess this depends on the front end setting.

Then here is a question, when more images added to a source that already had a classifier, will it

  1. use VGG16 to extract new features and retrain on the whole feature set or
  2. use EfficientNetb0 (1). retrain only on newly extracted features since the old features are extracted using VGG16 and new features are from EfficientNetb0, they have different dimensions? (2). retrain the classifier on the whole features set in this case need to re-extract all the images?
beijbom commented 3 years ago

Qimin:

I suspect this line is the culprit:

https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78

Can you remove that and try again?

On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen notifications@github.com wrote:

@beijbom https://github.com/beijbom Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR

Downloading 11016 metadata and image/feature files...

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s]

Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...

Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...

2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images

2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch

2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels

2021-01-08 11:44:47,479 Entering: loading of reference data

2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds.

Traceback (most recent call last):

File "scripts/regression/retrain_source.py", line 106, in

fire.Fire()

File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire

component_trace = _Fire(component, args, parsed_flag_args, context, name)

File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire

target=component.__name__)

File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace

component = fn(*varargs, **kwargs)

File "scripts/regression/retrain_source.py", line 69, in train

do_training(source_root, train_labels, val_labels, n_epochs, clf_type)

File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training

train_labels, val_labels, n_epochs, [], feature_loc, clf_type)

File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in call

clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type)

File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train

refx, refy = load_batch_data(labels, ref_set, classes, feature_loc)

File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data

x_, y_ = load_image_data(labels, imkey, classes, feature_loc)

File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data

assert rc_labels_set.issubset(rc_features_set)

AssertionError

While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine.

To reproduce this AssertionError:

  1. clone the up-to-date pyspacer repo
  2. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
  3. run below to cache the features from bucket and retrain LR python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

Please let me know if you can reproduce the error.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beijbom/pyspacer/issues/38, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITTF5QJUG4PLMYVXT2UI3SY5QW3ANCNFSM4V2453EA .

beijbom commented 3 years ago

On Fri, Jan 8, 2021 at 3:22 PM kriegman notifications@github.com wrote:

Qimin,

If a user has a source that's using the older VGG features, does coralnet train and create your MLP classifiers? Or does it still use

the older LR code classifier? Could that be the problem?

LR https://github.com/beijbom/coralnet/blob/master/project/config/settings/vision_backend.py#L38

Could the old classifiers be lost because the retraining treated it as a "new classifier with new features" as if the source was toggled to new features?

David

On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen notifications@github.com wrote:

@beijbom < https://urldefense.com/v3/__https://github.com/beijbom__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2kTfdGBD$

Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR Downloading 11016 metadata and image/feature files...

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s] Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images... Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099... 2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images 2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch 2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels 2021-01-08 11:44:47,479 Entering: loading of reference data 2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds. Traceback (most recent call last): File "scripts/regression/retrain_source.py", line 106, in fire.Fire() File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire target=component.name) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/regression/retrain_source.py", line 69, in train do_training(source_root, train_labels, val_labels, n_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training train_labels, val_labels, n_epochs, [], feature_loc, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in call clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train refx, refy = load_batch_data(labels, ref_set, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batchdata x, y_ = load_image_data(labels, imkey, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data assert rc_labels_set.issubset(rc_features_set) AssertionError

While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine.

To reproduce this AssertionError:

  1. clone the up-to-date pyspacer repo
  2. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket.

https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24 < https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py*L24__;Iw!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2sZkdfEf$

  1. run below to cache the features from bucket and retrain LR python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

Please let me know if you can reproduce the error.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/38__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2pNLDX4C$ , or unsubscribe < https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AMPEKKDSYC2DQJZFG3SY5QW3ANCNFSM4V2453EA__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2g7_Yf_W$

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beijbom/pyspacer/issues/38#issuecomment-757045178, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITTF7SZGAYIGT3HKUERCTSY6HS5ANCNFSM4V2453EA .

qiminchen commented 3 years ago

I suspect this line is the culprit: https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78 Can you remove that and try again?

@beijbom you're right, but instead of removing the line, I changed it to (ann['row'], ann['col'], ann['label']) for ann in anns, so basically remove the -1 from both row and col and guess what, it passed the assertion and I got the normal accuracy which is around 75% as the author claimed.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 96639.19it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 18:00:13,024 Trainset: 3020, valset: 200 images
2021-01-08 18:00:13,024 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 18:00:13,032 Trainset: 60, valset: 48, common: 48 labels
2021-01-08 18:00:13,032 Entering: loading of reference data
2021-01-08 18:00:16,864 Exiting: loading of reference data after 3.831654 seconds.
2021-01-08 18:00:16,864 Entering: training using LR
2021-01-08 18:02:04,396 Epoch 0, acc: 0.7422
2021-01-08 18:03:47,539 Epoch 1, acc: 0.7532
2021-01-08 18:05:32,405 Epoch 2, acc: 0.7562
2021-01-08 18:07:15,441 Epoch 3, acc: 0.7582
2021-01-08 18:08:56,827 Epoch 4, acc: 0.761
2021-01-08 18:10:38,644 Epoch 5, acc: 0.7618
2021-01-08 18:12:20,371 Epoch 6, acc: 0.7624
2021-01-08 18:14:01,516 Epoch 7, acc: 0.7622
2021-01-08 18:15:42,928 Epoch 8, acc: 0.7626
2021-01-08 18:17:24,107 Epoch 9, acc: 0.763
2021-01-08 18:17:24,107 Exiting: training using LR after 1027.243072 seconds.
2021-01-08 18:17:24,107 Entering: calibration
2021-01-08 18:17:24,466 Exiting: calibration after 0.358726 seconds.
Re-trained BonaireCoralReefMonitoring_2020 (2099). Old acc: 45.9, new acc: 77.2

Oscar, can you remind me of -1 in both row and col here? to be consistent with the 0-index?

StephenChan commented 3 years ago

In case it helps, I took a pass through all the sources where a new classifier was trained since the rollout:

I got these source IDs with the following in manage.py shell:

import datetime
from django.utils import timezone
from vision_backend.models import Classifier
Classifier.objects.all().filter(create_date__gt=datetime.datetime(2020, 12, 31, 5, 0, tzinfo=timezone.utc)).values_list('source', flat=True).distinct()
qiminchen commented 3 years ago

thanks @StephenChan , so s2099 is weird as the author said the old classifier had 75% accuracy but now it presents as a new source

beijbom commented 3 years ago

Is there any correlation between the ones with low accuracy and the ones where we fixed EXIF stuff?

On Fri, Jan 8, 2021 at 6:42 PM Qimin Chen notifications@github.com wrote:

thanks @StephenChan https://github.com/StephenChan , so s2099 is weird as the author said the old classifier had 75% accuracy but now it presents as a new source

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beijbom/pyspacer/issues/38#issuecomment-757082561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITTF64C52JPYVOCBZXMB3SY67C5ANCNFSM4V2453EA .

StephenChan commented 3 years ago

Is there any correlation between the ones with low accuracy and the ones where we fixed EXIF stuff?

I didn't take stats on EXIF orientations across all sources. Only the ones that were annotated in certain months in 2020. That source list was published in this blog post. The only relevant source from there is source 1646, but that only had 2 images with non-default EXIF orientations, so that seems unlikely to make a big difference.

beijbom commented 3 years ago

Gotit. Yeah, this eludes me. Two things in particular. 1) how can Qiming get the expected performance when he retrains on the same features as is used in production? 2) where did the previous classifiers go?

On Fri, Jan 8, 2021 at 9:28 PM StephenChan notifications@github.com wrote:

Is there any correlation between the ones with low accuracy and the ones where we fixed EXIF stuff?

I didn't take stats on EXIF orientations across all sources. Only the ones that were annotated in certain months in 2020. That source list was published in this blog post https://coralnet.ucsd.edu/blog/annotation-tool-bug-fixes-follow-up-checking-potentially-affected-images/. The only relevant source from there is source 1646, but that only had 2 images with non-default EXIF orientations, so that seems unlikely to make a big difference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beijbom/pyspacer/issues/38#issuecomment-757099784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITTF6OATQQGLMRKZ5AI7DSY7SOTANCNFSM4V2453EA .

StephenChan commented 3 years ago

Hmm, for 2), if they were doing a lot of annotation work in this source recently, maybe they realized they needed to add a label or two - which would involve a labelset change, which would involve a classifier reset. We can ask if they had to do that.

beijbom commented 3 years ago

Hmm. Yeah. It’s worth asking them.

On Sat, Jan 9, 2021 at 02:39 StephenChan notifications@github.com wrote:

Hmm, for 2), if they were doing a lot of annotation work in this source recently, maybe they realized they needed to add a label or two - which would involve a labelset change, which would involve a classifier reset. We can ask if they had to do that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beijbom/pyspacer/issues/38#issuecomment-757130264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITTF6LNCT7NGIXTK2IPQTSZAW47ANCNFSM4V2453EA .

qiminchen commented 3 years ago

For 1), you should be able to get the expected performance on your end as well

  1. Clone this repo
  2. Change (ann['row']-1, ann['col']-1, ann['label']) for ann in anns to (ann['row'], ann['col'], ann['label']) for ann in anns https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78
  3. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
  4. run below to cache the features from bucket and retrain LR python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR
StephenChan commented 3 years ago

I went ahead and inspected a DB backup from just before the rollout. Source 2099 did have 7 classifiers, highest having 77% accuracy. The labelset had 68 labels, and now it has 69 labels. So they did change the labelset, and that must be why the classifiers got cleared. That seems to solve mystery number 2 then.

Let me know if you want any info from this DB backup which might help with figuring out the accuracy drop.

beijbom commented 3 years ago

Regarding the first mystery, I tracked down some data including the ides for the classifier (from the UI) and the batch id (by querying the server)

jobs = BatchJob.objects.filter()
>>> for job in jobs:
...  if '17524' in job.job_token:
...   print(job)

classifier id: 17524 batch job id: 1787

I'm uploaded the payloads for the training and results link here: [link]. The job_msg is parsed by

https://github.com/beijbom/pyspacer/blob/master/spacer/mailman.py#L17

and defines the train job.

@Qimin Chen : can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts.

(You are going to have to change the bucket_names and keys to the test bucket. E.g: "model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"})

I think it'd be nice to understand what happened. At the same time, I'm tempted to ask the user to switch to efficientNet. I'm pretty sure that'd wipe out the issue and he should be switching anyways.

beijbom commented 3 years ago

@StephenChan @kriegman : Are you ok if we ask the user to switch to EfficientNet? We have already backed up all the (likely faulty) features data, so we don't lose reproducibility. But this way the user is unblocked and it's a double win since his backend will work even better than before.

StephenChan commented 3 years ago

That sounds reasonable to me.

qiminchen commented 3 years ago

can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts.

(You are going to have to change the bucket_names and keys to the test bucket. E.g: "model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"})

@beijbom the aws_access_key_id and aws_secret_access_key you generated for me a while ago doesn’t have permission to access coralnet-production bucket, can you regenerate one? btw what is "key": "media/classifiers/17524.model" here used for if I run a training locally?

beijbom commented 3 years ago

@Qimin Chen qic003@ucsd.edu : yeah, you have to replace the bucket_name to point to the test bucket and the key field to correspond to the exported data.

On Tue, Jan 12, 2021 at 2:59 PM Qimin Chen notifications@github.com wrote:

can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts.

(You are going to have to change the bucket_names and keys to the test bucket. E.g: "model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"})

@beijbom https://github.com/beijbom the aws_access_key_id and aws_secret_access_key you generated for me a while ago doesn’t have permission to access coralnet-production bucket, can you regenerate one? btw what is "key": "media/classifiers/17524.model" here used for if I run a training locally?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beijbom/pyspacer/issues/38#issuecomment-759090959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITTFZOJ23H7A2LYFNDM63SZTH5ZANCNFSM4V2453EA .