Currently you need CoralNet S3 access in order to extract features at all. So, here we try to fix that.
Update:
Generalize feature extractor support by allowing any FeatureExtractor subclass instance, and any file locations instead of just from CoralNet's S3 bucket (which requires auth).
Test extractors/fixtures are still hidden away in CoralNet infrastructure for now. That's something to address in the future.
Filesystem-caching now applies to any S3/URL extractor-file locations.
SHA256-checking for extractor files is still supported, as an optional data_hashes in the FeatureExtractor constructor.
Rework relevant config vars. Of note, spacer is now usable without any specified config, if neither S3 or remote-loaded extractors are needed.
Remove awscli dependency in Dockerfile, because 1) the Dockerfile commands that required awscli no longer apply, and 2) there's currently a Cython 3.x related problem with installing PyYAML, a dependency of awscli (latest PyYAML, 6.0.1, tries to work around it by pinning Cython below 3, but even better to not worry about it at all).
Code/design notes off the top of my head:
extract_features.py has a fair bit of code related to storage now. Maybe some of this should be factored out into storage.py later.
Maybe there should end up being a File or Blob class which is more general than DataLocation. spacer's 'pure' concept of DataLocation is to be a location of a DataClass. FeatureExtractor isn't a DataClass as it's not just a vessel for data, it also has a method to extract features. However, FeatureExtractor isn't the only instance of the code where the DataLocation concept is stretched, with the code reaching directly into the DataLocation fields rather than calling its methods. So making a cleaner design would take a few steps. In the interest of time, I put it off for the moment.
Currently you need CoralNet S3 access in order to extract features at all. So, here we try to fix that.
Update:
Generalize feature extractor support by allowing any FeatureExtractor subclass instance, and any file locations instead of just from CoralNet's S3 bucket (which requires auth).
data_hashes
in theFeatureExtractor
constructor.Code/design notes off the top of my head:
extract_features.py
has a fair bit of code related to storage now. Maybe some of this should be factored out intostorage.py
later.File
orBlob
class which is more general thanDataLocation
. spacer's 'pure' concept ofDataLocation
is to be a location of aDataClass
.FeatureExtractor
isn't aDataClass
as it's not just a vessel for data, it also has a method to extract features. However,FeatureExtractor
isn't the only instance of the code where theDataLocation
concept is stretched, with the code reaching directly into theDataLocation
fields rather than calling its methods. So making a cleaner design would take a few steps. In the interest of time, I put it off for the moment.