Changed to compressed numpy array for storing features

beijbom commented 3 years ago

This PR writes a custom .store() and .load() method for ImageFeatures.

It reduces storage from 271k to 24k for the example below with 10 feature vectors from efficientnet_b0_ver1. Storage time to local changed from 0.01 to 0.003 seconds.

Further, this changes to using np.half inside the PointFeatures class so training on these features should be faster also. https://github.com/qiminchen/CoralNet/issues/13

Question: I could settle for np.float (32 bit precision) at twice the storage cost. I'm not sure which is better.

To test

To test run e.g. the code below on master and on this branch and check the file-size on disk.

from spacer.messages import ExtractFeaturesMsg
from spacer.tasks import extract_features
msg = ExtractFeaturesMsg(
        job_token='asdas',
        feature_extractor_name='efficientnet_b0_ver1',
        rowcols=[(i, i) for i in range(10)],
        image_loc=DataLocation(
            storage_type='s3',
            bucket_name='spacer-test',
            key='08bfc10v7t.png'),
        feature_loc=DataLocation(
            storage_type='filesystem',
            key='tmp.feats'
        )
    )
_ = extract_features(msg)

beijbom commented 3 years ago

NOTE1: an earlier assessment of these changes used synthetic ImageFeatures data. For that I got a full 3 orders of magnitude storage improvement. With real data, it's "only" 1 order of magnitude.

NOTE2: I had to reduce the precision of some legacy tests when using half precision.

beijbom commented 3 years ago

@StephenChan @qiminchen : thoughts on this PR? Preference for using float16 or float32? @qiminchen : did you get a chance to try re-training the classifiers using this setting?

qiminchen commented 3 years ago

@StephenChan @qiminchen : thoughts on this PR? Preference for using float16 or float32? @qiminchen : did you get a chance to try re-training the classifiers using this setting?

changes look great, hmm I would vote for float32 even tho float16 does save memory and we don't see performance dropping, it has less precision and we don't want some "potential" precision issue in the future, also most of the models use float32 as default dtype.

I have a few meetings this week and presentations, so I will work on it today or tmr.

StephenChan commented 3 years ago

thoughts on this PR? Preference for using float16 or float32?

Changes look good to me, and I got the same size results on the test code in the first post: tmp.feats was 271k with master, 24k with this PR's branch.

I think Qimin's reasoning on using float32 makes sense. And the extra x2 savings on storage doesn't seem like a big deal for CoralNet's traditional usage, at least. If I understand correctly, we're talking 240 KB savings for a 100-point image, when the image file itself can be 5-10 MB. If we have a dense point cloud though, then that could be another story.

qiminchen commented 3 years ago

@beijbom @StephenChan here are some comparisons between this branch and the master on training the classifier. The per Feature size would be doubled if using float32. The PR looks great, we can save a lot of storage and training time. There won't have any changes in extraction time tho.

NOTE: comparison shown as <master	this PR>	Source	Accuracy (%)
s1498	78.4 \| 79.0	~828kB \| ~71kB	33.14 \| 5.91
s294	86.3 \| 85.8	~5.5MB \| ~470kB	168.61 \| 15.79
s1396	89.6 \| 89.9	~277kB \| ~24.5kB	24.6 \| 8.7

beijbom commented 3 years ago

Thanks guys. I changed to float32, merged this and released 0.3.0 package. Also pushed a requirement bump to https://github.com/beijbom/coralnet/pull/323.

coralnet / pyspacer

Changed to compressed numpy array for storing features #33

To test