coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
7 stars 3 forks source link

Training: cache feature vectors for 2nd epoch onward #80

Closed StephenChan closed 6 months ago

StephenChan commented 9 months ago

Cache feature vectors in the local filesystem if they were loaded from remote storage (S3 or URL).

I think this will do what it intends to, but I couldn't figure out a way to do it without committing OOP crimes somewhere (data_classes.ImageFeatures.load() in this case). I think the better way to do this would involve passing a Storage to TrainClassifierMsg instead of a DataLocation, and putting the temporary directory attribute onto that Storage. That probably involves a much larger refactor overall where the designs of Storage, DataLocation, and perhaps DataClass are reworked. I have ideas for that but I don't think I'm up for that for the remainder of the month.

Also want to make the caching optional (but on by default), so that it can be turned off in case filesystem space is a concern, as it may be for coralnet's largest sources (particularly since older sources have feature vectors about 8x in filesize).

Let me know if it'd be useful to merge this PR in the short term though.

StephenChan commented 9 months ago
Source Images Points/im Epochs Time for epoch 1 Time for each subsequent epoch
1097 1686 10 3 96s 3s
3354 1600 50 10 178s 8s
295 63263 10 10 66m41s 2m22s

The code may be ugly so far, but the results sure aren't... (remember, each subsequent epoch used to be as long as epoch 1)

StephenChan commented 6 months ago

Feature caching PR

StephenChan commented 6 months ago

Ready for review. Recent changes:

Still want to refactor adjacent classes/methods for better OOP design, but not urgent and will leave that for another time.