switch to using `intake` catalogs for data sources

alan-turing-institute / scivision

scivision: a framework for scientific image analysis

https://sci.vision/

BSD 3-Clause "New" or "Revised" License

94 stars 39 forks source link

switch to using `intake` catalogs for data sources #37

Closed quantumjot closed 2 years ago

ots22 commented 2 years ago

data sources point to github repos that contain intake drivers
example for plankton ~(where?)~
make test catalogue with a single image (🐨) and use this in the notebook example

acocac commented 2 years ago

Example for Plankton

Note urlpath should be replaced by the full path of the directory in GDrive. The following catalog consists of two entries i) single image and ii) stack i.e. concatanate multiple images to a common image shape e.g. 256 x 256 pixels:

%%writefile catalog.yaml
sources:
  plankton_single:
      description: Load a single labeled images from CEFAS zooplankton dataset
      origin: 
      driver: intake_xarray.image.ImageSource
      parameters:
        species:
          description: which species to collect
          type: str
          default: Bivalvia-Larvae
        id:
          description: which filenmae
          type: str
          default: Pia1.2017-10-03.1726+N00296780_hc
      args:
        urlpath: '/content/gdrive/.../{{species}}/{{id}}.tif'
        storage_options: {'anon': True}
  plankton_all:
      description: Labeled images from CEFAS zooplankton dataset
      origin: 
      driver: intake_xarray.image.ImageSource
      args:
        urlpath: '/content/gdrive/.../{species}/{id}.tif'
        storage_options: {'anon': True}
        concat_dim: [id, species]
        coerce_shape: [256, 256]
      metadata:
        shape: images_shape_all

acocac commented 2 years ago

Additional to the plankton example, here are three examples from the Environmental AI Book contributors using intake for cataloguing files in different formats:

Fetching images in H5 format for wildfire analysis. The demonstrator includes a cell defining a customised intake driver.
Fetching images in geoTIFF format for tree canopy delineation. The example uses an existing intake driver to fetch files from Zenodo repositories.
Fetching tabular data in csv format for analysing ground sensor records. The example fetches tables from a Amazon bucket.

I hope the above examples are useful to understand how intake could be beneficial for cataloguing and handling different formats for scivision.

edwardchalstrey1 commented 2 years ago

@acocac can this issue be closed?

acocac commented 2 years ago

yep, let me close it.