coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
6 stars 2 forks source link

Changed to compressed numpy array for storing features #33

Closed beijbom closed 3 years ago

beijbom commented 3 years ago

This PR writes a custom .store() and .load() method for ImageFeatures.

It reduces storage from 271k to 24k for the example below with 10 feature vectors from efficientnet_b0_ver1. Storage time to local changed from 0.01 to 0.003 seconds.

Further, this changes to using np.half inside the PointFeatures class so training on these features should be faster also. https://github.com/qiminchen/CoralNet/issues/13

Question: I could settle for np.float (32 bit precision) at twice the storage cost. I'm not sure which is better.

To test

To test run e.g. the code below on master and on this branch and check the file-size on disk.

from spacer.messages import ExtractFeaturesMsg
from spacer.tasks import extract_features
msg = ExtractFeaturesMsg(
        job_token='asdas',
        feature_extractor_name='efficientnet_b0_ver1',
        rowcols=[(i, i) for i in range(10)],
        image_loc=DataLocation(
            storage_type='s3',
            bucket_name='spacer-test',
            key='08bfc10v7t.png'),
        feature_loc=DataLocation(
            storage_type='filesystem',
            key='tmp.feats'
        )
    )
_ = extract_features(msg)
beijbom commented 3 years ago

NOTE1: an earlier assessment of these changes used synthetic ImageFeatures data. For that I got a full 3 orders of magnitude storage improvement. With real data, it's "only" 1 order of magnitude.

NOTE2: I had to reduce the precision of some legacy tests when using half precision.

beijbom commented 3 years ago

@StephenChan @qiminchen : thoughts on this PR? Preference for using float16 or float32? @qiminchen : did you get a chance to try re-training the classifiers using this setting?

qiminchen commented 3 years ago

@StephenChan @qiminchen : thoughts on this PR? Preference for using float16 or float32? @qiminchen : did you get a chance to try re-training the classifiers using this setting?

changes look great, hmm I would vote for float32 even tho float16 does save memory and we don't see performance dropping, it has less precision and we don't want some "potential" precision issue in the future, also most of the models use float32 as default dtype.

I have a few meetings this week and presentations, so I will work on it today or tmr.

StephenChan commented 3 years ago

thoughts on this PR? Preference for using float16 or float32?

Changes look good to me, and I got the same size results on the test code in the first post: tmp.feats was 271k with master, 24k with this PR's branch.

I think Qimin's reasoning on using float32 makes sense. And the extra x2 savings on storage doesn't seem like a big deal for CoralNet's traditional usage, at least. If I understand correctly, we're talking 240 KB savings for a 100-point image, when the image file itself can be 5-10 MB. If we have a dense point cloud though, then that could be another story.

qiminchen commented 3 years ago

@beijbom @StephenChan here are some comparisons between this branch and the master on training the classifier. The per Feature size would be doubled if using float32. The PR looks great, we can save a lot of storage and training time. There won't have any changes in extraction time tho.

NOTE: comparison shown as <master this PR> Source Accuracy (%) per Feature size Training time (s)
s1498 78.4 | 79.0 ~828kB | ~71kB 33.14 | 5.91
s294 86.3 | 85.8 ~5.5MB | ~470kB 168.61 | 15.79
s1396 89.6 | 89.9 ~277kB | ~24.5kB 24.6 | 8.7
beijbom commented 3 years ago

Thanks guys. I changed to float32, merged this and released 0.3.0 package. Also pushed a requirement bump to https://github.com/beijbom/coralnet/pull/323.