Doodleverse / segmentation_zoo

A collection of geoscientific image segmentation models
MIT License
7 stars 3 forks source link

better programmatic access to Zoo models from zenodo #15

Closed dbuscombe-usgs closed 1 year ago

dbuscombe-usgs commented 1 year ago

New zenodo releases containing Gym models intended for Zoo, for example https://zenodo.org/record/6950472, contain config files and h5 files, and a BEST_MODEL.txt. There are typically several models per release, which facilitates ensembling

Downloads can take a long time. We need a better way to access individual files programmatically, i.e. only the files we want to use

For example, in the above release, BEST_MODEL.txt contains the string sat4class_rgb_512_v1_fullmodel.h5 which could be downloaded by constructing a string URL, then using requests (or similar) to download that file and corresponding json file

But if we only want to download certain files, what options do we have?

pip install pyzenodo3 (https://pypi.org/project/pyzenodo3/) requires tokens - is this a good direction?

pip install zenodo-get (https://github.com/dvolgyes/ zenodo_get) - I think this could be a good option, but is designed to run directly from a CLI rather than from a library import (?)

Perhaps the easiest thing to do is upload a manifest file to each zenodo release containing lists of model weights .... the manifest file would be downloaded, and any combination of the files therein could be downloaded from requests (or similar)

For example https://zenodo.org/api/records/6950472 contains names and URLs and checksums of every file in the aforementioned data release. So, borrowing from zenodo-get we would access this JSON record like this

r = requests.get(url + recordID)

then a list of files like this

            js = json.loads(r.text)
            files = js['files']

the first record is

{'bucket': '70633e17-73c6-4f5e-adb1-062725e6a7c1',
 'checksum': 'md5:d495fa82a5389c322b69977de0362ed1',
 'key': 'BEST_MODEL.txt',
 'links': {'self': 'https://zenodo.org/api/files/70633e17-73c6-4f5e-adb1-062725e6a7c1/BEST_MODEL.txt'},
 'size': 33,
 'type': 'txt'}

I think this is a good direction to go in for programmatic access to any individual or collection of files

dbuscombe-usgs commented 1 year ago

fixed and implemented in https://github.com/Doodleverse/segmentation_zoo/commit/32380ead0f61c58daac29d4608ce40545cffd382