fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
310 stars 80 forks source link

Could we handle geotiff? #78

Closed rsignell-usgs closed 1 year ago

rsignell-usgs commented 3 years ago

Was chatting with @jkellndorfer and currently a lot of folks who work with geotiff use VRT to create virtual datasets (mosaics, time stacks, etc).

Could we do the same with fsspec-reference-maker, thereby allowing folks to read from large virtual collections and avoiding GDAL?

Seems that would be very cool.

martindurant commented 3 years ago

Basic TIFFs are already handled by TIFFFile - so we could just need to analyse the coordinates correctly. Just!

When mosaicing, are the pieces guaranteed to form a contiguous block with no overlaps?

jkellndorfer commented 3 years ago

@martindurant: The data set is based on 1x1 degree tiles with geotiffs that are not overlapping, covering all land and ice masses, but not water areas, i.e. there are gaps from non-exisiting files, the VRTs do work globally though.

martindurant commented 3 years ago

gaps from non-exisiting files

This is not a problem for a zarr-based interface, missing files just become missing values like nan.

jkellndorfer commented 3 years ago

Great. So it would just (!) take a vrt2zarr tool that is just (!) adding a relevant metadata description using the existing geotiffs, right? Just of course emphasized ...

martindurant commented 3 years ago

Caveat that I don't know how VRTs work; but in as much as they are pointers to file (pieces) in other places, yes. The trick is to figure out the zarr key for each data chunk.

jkellndorfer commented 3 years ago

Maybe going via VRTs is more of a hurdle than helping. The basic structure is a grouping of geotiffs that all have the same organization (datatype, compression, xy-coordinate system) and only differ in their spatial extent individually, yet merge seamlessly. Take a look at http://sentinel-1-global-coherence-earthbigdata.s3-website-us-west-2.amazonaws.com how tiles are organized in the data set on s3.

martindurant commented 3 years ago

OK, so what we actually need, is to scan each file, presumably with TIFFFile, find the data portion as normal, and be sure to extract the correct coordinate information (position and time). Then we combine everything into one big aggregate dataset.

jkellndorfer commented 3 years ago

Seems to be straightforward. Caveat is the assignment of a time coordinate. There are four seasons (northern hemisphere winter, spring, summer, fall) that could be assigned a time coordinate. In each season we would have a set of dataset variables, some of which are only covered in certain parts of the globe or at certain times of the year. But I guess that could be all captured in the metadata.

martindurant commented 3 years ago

We would end up with a time coordinate that is the union of all the available times, and accessing variable/time combinations that are not covers would result in NaNs.

jkellndorfer commented 3 years ago

there are only 4 available times, given by the filenames as winter, spring, summer, fall. I guess one would pick a middate or just don't make it a time coordinate, and leave it as ordered labeled bands? That's what is done via the VRTs now.

martindurant commented 3 years ago

As the data curator who understand the original data better than me, the choice would be yours! I would tend towards changing the original labels and assumptions as little as possible.

jkellndorfer commented 3 years ago

I would agree. Would you have any examples on how to go about this or might be interested in tackling this togehter? I think we really don't need to go via VRTs but can work from the tiffs themselves. Total number of geotiffs in the data set 1,034,038 in ~25,000 1x1 degree tiles for a total volume of 2.1 TB. Sure would be super cool to just open one "zarr" file and use xarray and dask magic to analyze the data.

martindurant commented 3 years ago

So the first thing we need is the bytes range for a file using tifffile and see if that references output contains all the information we need (bandpass, time, location coordinates).

jkellndorfer commented 3 years ago

ok. I am installing it. Here is an example file I am using: https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/N42W090/N42W090_fall_vv_COH12.tif

jkellndorfer commented 3 years ago

Seems like a good output:

jkellndorfer commented 3 years ago
tifffile N42W090_fall_vv_COH12.tif

Reading TIFF header: 0.002332 s
Reading image data:  0.013988 s
Generating report:   0.069484 s

TiffFile 'N42W090_fall_vv_COH12.tif'  899.52 KiB  geotiff

TiffPageSeries 0  1200x1200  uint8  YX  Generic  1 Pages

TiffPage 0 @8  1200x1200  uint8  minisblack lzw geotiff

TiffTag 256 ImageWidth @10 SHORT @18 1200
TiffTag 257 ImageLength @22 SHORT @30 1200
TiffTag 258 BitsPerSample @34 SHORT @42 8
TiffTag 259 Compression @46 SHORT @54 LZW
TiffTag 262 PhotometricInterpretation @58 SHORT @66 MINISBLACK
TiffTag 273 StripOffsets @70 LONG[200] @1030 (1990, 6467, 10995, 15556, 20128,
TiffTag 277 SamplesPerPixel @82 SHORT @90 1
TiffTag 278 RowsPerStrip @94 SHORT @102 6
TiffTag 279 StripByteCounts @106 LONG[200] @230 (4477, 4528, 4561, 4572, 4473,
TiffTag 284 PlanarConfiguration @118 SHORT @126 CONTIG
TiffTag 317 Predictor @130 SHORT @138 NONE
TiffTag 339 SampleFormat @142 SHORT @150 UINT
TiffTag 33550 ModelPixelScaleTag @154 DOUBLE[3] @1830 (0.00083333333, 0.0008333
TiffTag 33922 ModelTiepointTag @166 DOUBLE[6] @1854 (0.0, 0.0, 0.0, -90.0, 42.0
TiffTag 34735 GeoKeyDirectoryTag @178 SHORT[32] @1902 (1, 1, 0, 7, 1024, 0, 1,
TiffTag 34736 GeoDoubleParamsTag @190 DOUBLE[2] @1966 (298.257223563, 6378137.0
TiffTag 34737 GeoAsciiParamsTag @202 ASCII[8] @1982 WGS 84|
TiffTag 42113 GDAL_NODATA @214 ASCII[2] @222 0

StripOffsets
(1990, 6467, 10995, 15556, 20128, 24601, 28997, 33372, 37927, 42281, 46529,
 50958, 55479, 59944, 64505, 68906, 73437, 77971, 82459, 87104, 91871, 96363,
 100870, 105256, 109544, 113898, 118374, 122791, 127370, 131910, 136443,
 140998, 145696, 150448, 155111, 159657, 164259, 168891, 173718, 178453,
 183307, 188071, 192799, 197584, 202241, 206883, 211590, 216310, 221048,
 225633, 229962, 234406, 238930, 243581, 248090, 252536, 257180, 261840,
 266341, 270829, 275333, 279866, 284370, 288756, 293301, 297760, 302201,
 306732, 311280, 315925, 320654, 325313, 330012, 334624, 339191, 343863,
 348405, 352962, 357547, 362147, 366711, 371311, 375928, 380635, 385238,
 389747, 394263, 398962, 403793, 408743, 413549, 418221, 422862, 427495,
 432058, 436550, 441183, 445641, 450069, 454557, 459184, 463675, 468041,
 472380, 476880, 481366, 485863, 490363, 494807, 499332, 503928, 508408,
 512905, 517493, 522036, 526593, 531154, 535624, 540178, 544762, 549404,
 554042, 558815, 563600, 568347, 573184, 578128, 583201, 588167, 593070,
 597966, 602817, 607612, 612506, 617519, 622713, 627673, 632096, 636533,
 640934, 645497, 650114, 654936, 659542, 663928, 668422, 672848, 677331,
 681927, 686604, 691246, 695939, 700526, 705032, 709512, 714010, 718579,
 723044, 727524, 732030, 736530, 741006, 745644, 750321, 754825, 759433,
 763937, 768528, 773160, 777819, 782481, 787117, 791830, 796496, 801272,
 806015, 810704, 815397, 820134, 824916, 829401, 834007, 838816, 843476,
 847961, 852609, 857182, 861620, 866219, 870723, 875230, 879697, 884230,
 888899, 893595, 898158, 902629, 907180, 911743, 916379)

StripByteCounts
(4477, 4528, 4561, 4572, 4473, 4396, 4375, 4555, 4354, 4248, 4429, 4521, 4465,
 4561, 4401, 4531, 4534, 4488, 4645, 4767, 4492, 4507, 4386, 4288, 4354, 4476,
 4417, 4579, 4540, 4533, 4555, 4698, 4752, 4663, 4546, 4602, 4632, 4827, 4735,
 4854, 4764, 4728, 4785, 4657, 4642, 4707, 4720, 4738, 4585, 4329, 4444, 4524,
 4651, 4509, 4446, 4644, 4660, 4501, 4488, 4504, 4533, 4504, 4386, 4545, 4459,
 4441, 4531, 4548, 4645, 4729, 4659, 4699, 4612, 4567, 4672, 4542, 4557, 4585,
 4600, 4564, 4600, 4617, 4707, 4603, 4509, 4516, 4699, 4831, 4950, 4806, 4672,
 4641, 4633, 4563, 4492, 4633, 4458, 4428, 4488, 4627, 4491, 4366, 4339, 4500,
 4486, 4497, 4500, 4444, 4525, 4596, 4480, 4497, 4588, 4543, 4557, 4561, 4470,
 4554, 4584, 4642, 4638, 4773, 4785, 4747, 4837, 4944, 5073, 4966, 4903, 4896,
 4851, 4795, 4894, 5013, 5194, 4960, 4423, 4437, 4401, 4563, 4617, 4822, 4606,
 4386, 4494, 4426, 4483, 4596, 4677, 4642, 4693, 4587, 4506, 4480, 4498, 4569,
 4465, 4480, 4506, 4500, 4476, 4638, 4677, 4504, 4608, 4504, 4591, 4632, 4659,
 4662, 4636, 4713, 4666, 4776, 4743, 4689, 4693, 4737, 4782, 4485, 4606, 4809,
 4660, 4485, 4648, 4573, 4438, 4599, 4504, 4507, 4467, 4533, 4669, 4696, 4563,
 4471, 4551, 4563, 4636, 4729)

ModelPixelScaleTag
(0.00083333333, 0.00083333333, 0.0)

ModelTiepointTag
(0.0, 0.0, 0.0, -90.0, 42.0, 0.0)

GeoKeyDirectoryTag
(1, 1, 0, 7, 1024, 0, 1, 2, 1025, 0, 1, 1, 2048, 0, 1, 4326, 2049, 34737, 7, 0,
 2054, 0, 1, 9102, 2057, 34736, 1, 1, 2059, 34736, 1, 0)

GeoDoubleParamsTag
(298.257223563, 6378137.0)

GEOTIFF_METADATA
{'GTModelTypeGeoKey': <ModelType.Geographic: 2>,
 'GTRasterTypeGeoKey': <RasterPixel.IsArea: 1>,
 'GeogAngularUnitsGeoKey': <Angular.Degree: 9102>,
 'GeogCitationGeoKey': 'WGS 84',
 'GeogInvFlatteningGeoKey': 298.257223563,
 'GeogSemiMajorAxisGeoKey': 6378137.0,
 'GeographicTypeGeoKey': <GCS.WGS_84: 4326>,
 'KeyDirectoryVersion': 1,
 'KeyRevision': 1,
 'KeyRevisionMinor': 0,
 'ModelPixelScale': [0.00083333333, 0.00083333333, 0.0],
 'ModelTiepoint': [0.0, 0.0, 0.0, -90.0, 42.0, 0.0]}
martindurant commented 3 years ago

I get a structure like

{'.zattrs': '{}',
 '.zarray': '{\n "chunks": [\n  6,\n  1200\n ],\n "compressor": {\n  "id": "imagecodecs_lzw"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  1200,\n  1200\n ],\n "zarr_format": 2\n}',
 '0.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 1990, 4477],
 '1.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 6467, 4528],
 '2.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 10995, 4561],
...

200 small chunks. I see that the filename likely encodes the time and variable name, but where are the location coordinates?

jkellndorfer commented 3 years ago

ModelPixelScaleTag (0.00083333333, 0.00083333333, 0.0)

ModelTiepointTag (0.0, 0.0, 0.0, -90.0, 42.0, 0.0)

The ModelPixelScaleTag has the pixel spacing in x and y 0.00083333333 The ModelTiepointTag has the upper left corner xcoordinates: -90 (=West 90deg) and y coordinates 42 (=North 42deg)

With TiffTag 256 ImageWidth @10 SHORT @18 1200 TiffTag 257 ImageLength @22 SHORT @30 1200

We know the image size and can calculate the bounding box.

e.g. lower right is -90+1200 x 00083333333 = -89 42 - 1200 x 00083333333 = 41

jkellndorfer commented 3 years ago

But we can also easliy derive the bounding box from the file name N42W090 refers to the upper left coordinate of the tile. And each tile is exactly 1x1 degree.

martindurant commented 3 years ago

OK, so that's one thing. For this particular dataset, the files being so small and being made up of so many even much smaller chunks, we may decide that addressing each of the sub-chunks is going to be prohibitively slow.

I have not yet written a numcodecs codec which does "use function x to load this whole file", but it would be trivial to do, and this case might require it.

Something like

class WholeFileDecoder(Codec):
    def __init__(self, function_loc, kwargs):
        self.func = import_name(function_loc)
        self.kwargs = kwargs

    def decode(self, buf, _):
        return self.func(io.BytesIO(buf), **self.kwargs)

where for TIFF it you want to encode TiffFile().asarray(). The grib2 code does something like this.

cgohlke commented 3 years ago

Interesting. Let me know if anything could be added to tifffile to make this easier.

IIUC, it might be more efficient not to index the strips/tiles in each TIFF file. Instead, index each file as a single chunk and use a TIFF codec (e.g. imagecodecs_tiff based on libtiff).

Tifffile can parse a sequence of file names to higher dimensions using a regular expression pattern and export a fsspec reference (see the test at https://github.com/cgohlke/tifffile/blob/1c8e75bf29d591058311aee7856fc2c73dea5a83/tests/test_tifffile.py#L12744-L12779). This works with any kind of format supported by imagecodecs.

It's currently not possible to parse categories (e.g. "north"/"south", "summer"/"winter"), only numbers/indices. Also, the current implementation reads one file/chunk to determine the chunk shape and dtype but that could be changed.

jkellndorfer commented 3 years ago

@martindurant I think you are on the right track that we treat the entire file as a chunk. When we did the processing I deliberately did not choose any blocksize tiling a la COG as the data were so small to begin with, but I guess the gdal defaults made these very small sub-chunks which I have to admit I was not aware of. @cgohlke thanks for chiming in with clarifications and offer to possibly adapt tifffile! Parsing categories would be nice indeed. There are a plethora of geospatial data sets in particular that use this tile naming scheme with North/South and East/West designation of location boundaries.

martindurant commented 3 years ago

@cgohlke , I see we have been thinking along the same lines. Since you have a codec that already handles whole tiff files, it would certainly make sense to just use that.

As far as combining is concerned - finding the key for each input file - there are a number of obvious things that we should be doing. We need to know the complete set of coordinates along the concat dimension(s), possibly done in a first pass, and for each input chunk:

jkellndorfer commented 3 years ago

Might the gdal generated VRT file be useful here after all? Here is an example of a VRT of one of the 88 data variables: https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/Global_fall_vv_COH12.vrt

jkellndorfer commented 3 years ago

If you take a look at the top of the VRT XML, it contains info on the georeferencing, origin (upper left) and pixel spacing in the tag, and then for each it describes the pixel/line offset and size in the <SrcRect ...> and <DstRect ..> tags.

<VRTDataset rasterXSize="432000" rasterYSize="166800">
  <SRS dataAxisToSRSAxisMapping="2,1">GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]</SRS>
  <GeoTransform> -1.8000000000000000e+02,  8.3333333000010982e-04,  0.0000000000000000e+00,  8.1000000000000000e+01,  0.0000000000000000e+00, -8.3333333000010982e-04</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1">
    <NoDataValue>0</NoDataValue>
    <ColorInterp>Gray</ColorInterp>
    <ComplexSource>
      <SourceFilename relativeToVRT="1">N00E005/N00E005_fall_vv_COH12.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="1200" RasterYSize="1200" DataType="Byte" BlockXSize="1200" BlockYSize="6" />
      <SrcRect xOff="0" yOff="0" xSize="1200" ySize="1200" />
      <DstRect xOff="222000" yOff="97200" xSize="1200" ySize="1200" />
      <NODATA>0</NODATA>
    </ComplexSource>
martindurant commented 3 years ago

If we already have the coordinates information from some other source, that would, of course, be totally reasonable too. I didn't gather whether the VRT thing is always available, or if it also needs construction at some effort.

jkellndorfer commented 3 years ago

gdalbuildvrt would have had to been called to generate the layers. I guess it's a question if we want the gdal dependency on generating the zarr store in this. gdalbuildvrt of the gdal suite just might do the work. Maybe the code they use for gdalbuildvrt has the pieces ready that you could use directly: https://github.com/TUW-GEO/OGRSpatialRef3D/blob/master/gdal-1.10.0/apps/gdalbuildvrt.cpp https://gdal.org/programs/gdalbuildvrt.html

martindurant commented 3 years ago

It would be OK to have the dependency for the same of making the references - we don't need it to load the data later. However, if there's a simpler way... I intuit that getting the arguments to gdalbuildvrt right requires some expertise, compared to writing a little lambda function to extract numbers from the filename.

jkellndorfer commented 3 years ago

yes, that makes sense to me when all the info can be gathered easily from the filenames. gdalbuildvrt scans each geotiff for the tifftags containing the georeferencing info and as such is more generic. tifffile seems to also offer the more generic solution. What gdalbuildvrt needs is actually just the list of geotiffs that should either be mosaicked into one layer or stacked (with the -separate switch.

jkellndorfer commented 3 years ago

One info we don't have from the filenames is the resolution needed for the coordinate dimensions I guess. But those could also easily be retrieved from the corresponding tifftag in one of the GeoTIFFs, maybe via tifffile.

cgohlke commented 3 years ago

Is it possible to get a list of all file names under https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/ ? I think tifffile might be able to generate fsspec reference files from the list of names.

martindurant commented 3 years ago

This being on s3, you can do

s3 = fsspec.filesystem("s3", anon=True)
allfiles = s3.ls("sentinel-1-global-coherence-earthbigdata/data/tiles")

(and you could get the equivalent HTTP URL for each of these 25528 files)

cgohlke commented 3 years ago

Thank you! Searching recursively finds 1,033,422 TIFF files. I'll give it a try.

jkellndorfer commented 3 years ago

That's the right amount of geotiffs. There should be 88 metrics, organized as explained in http://sentinel-1-global-coherence-earthbigdata.s3-website-us-west-2.amazonaws.com/

martindurant commented 3 years ago

Are metrics like xarray variables?

jkellndorfer commented 3 years ago

yes. Except for the lsmap and inc metrics the temporal component is four seasons, i.e. northern hemisphere winter, spring, summer and fall (in that order as data were processing from 1-12-2019 to 30-11-2020.

martindurant commented 3 years ago

So those would be one variable each, but with an extra dimension. Quite complex!

jkellndorfer commented 3 years ago

Oh yes, complex indeed!

Also,for extra complexity is there way we can encode the data conversions into the metadata? (as in the Data Types and Scaling Conventions section in the document link I sent).

martindurant commented 3 years ago

can encode the data conversions into the metadata?

To some extent: you could apply numcodecs Categorize for the enum-like fields, and FixedScaleOffset for linear transforms. Anything more complex, and you would have to write your own codec implementing the calculation, or expect the user to take the dataset and perform any manipulations needed after opening.

jkellndorfer commented 3 years ago

I think the latter approach, giving the user a recipe to convert the data is probably simplest for now. That's how I have done it in the notebooks thus far.

cgohlke commented 3 years ago

Got something working with the imagecodecs tiff decoder, e.g. to view the 444,797 COHxx files in napari via fsspec and zarr:

import imagecodecs.numcodecs
import zarr
import fsspec
import napari

imagecodecs.numcodecs.register_codecs()

name = 'sentinel-1-global-coherence-earthbigdata.COH.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=f'https://www.lfd.uci.edu/~gohlke/{name}',
    target_protocol='http',
)

za = zarr.open(mapper, mode='r')

viewer = napari.view_image(za, rgb=False, contrast_limits=[0, 255], colormap='gist_earth')
viewer.layers[0].name = name
viewer.dims.current_step = (112, 62, 1, 0, 0, 0, 0)
viewer.dims.axis_labels = (
    'latitude',
    'longitude',
    'season',
    'polarization',
    'coherence',
    'y',
    'x',
)
napari.run()

napari

Will try mosaicing/tiling next...

martindurant commented 3 years ago

If you can figure out how to concatenate on multiple dimensions, then I'd love to see how! I have been struggling with how to get this right!

cgohlke commented 3 years ago

Here's a tiled view of SoCal. I don't think napari can handle the whole dataset but one can slice the zarr/dask array:

import imagecodecs.numcodecs
import dask.array
import zarr
import fsspec
import napari

imagecodecs.numcodecs.register_codecs()

name = 'Global_COH.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=f'https://www.lfd.uci.edu/~gohlke/{name}',
    target_protocol='http',
)

data = zarr.open(mapper, mode='r')
data = dask.array.from_zarr(data)
roi = data[:, :, :, 46 * 1200 : 50 * 1200, 58 * 1200 : 66 * 1200]

viewer = napari.view_image(
    roi,
    rgb=False,
    contrast_limits=[0, 255],
    colormap='gist_earth',
    interpolation='nearest',
    name='SoCal',
)
viewer.dims.axis_labels = (
    'season',
    'polarization',
    'coherence',
    'latitude',
    'longitude',
)
viewer.dims.current_step = (0, 0, 1, 0, 0)
napari.run()

napari2

martindurant commented 3 years ago

Oh! It's a single 3.6TB 5-dimensional array! To make this readable by xarray, we'd have to add coordinate arrays too. I believe napari should be able o load data lazily, so if you start off with a field of view that's zoomed in to just a couple of tiles, it ought to work for the whole dataset. Of else, you can use the geo/holoviews stack (but they expect xarray, not raw zarr).

The zarr keys are derived purely from the filenames?

cgohlke commented 3 years ago

To make this readable by xarray, we'd have to add coordinate arrays too

I added _ARRAY_DIMENSIONS (not tested).

I believe napari should be able o load data lazily, so if you start off with a field of view that's zoomed in to just a couple of tiles

Setting viewer.camera.zoom to display one chunk, NAPARI_OCTREE and multiscale=False did not work for me. Napari loaded chunks until the system ran out of memory.

The zarr keys are derived purely from the filenames?

File names and chunk shape.

martindurant commented 3 years ago

I can add coordinates to the references, so that they look good using xarray - and you can select regions using the coordinate labels.

I gather that the longitude of the one segment could be described as

lon(x) -> -90 + x * 00083333333

and similar for lat - but what would be an expression covering the whole dataset for lon/lat?

cgohlke commented 3 years ago

These are the regex patterns and categories used to parse the filenames:

LATITUDE_LABEL = 'latitude'
LATITUDE_PATTERN = rf'(?P<{LATITUDE_LABEL}>[NS]\d+)'
LATITUDE_CATEGORY = {}
i = 0
for j in range(82, -1, -1):
    LATITUDE_CATEGORY[f'N{j:-02}'] = i
    i += 1
for j in range(1, 79):
    LATITUDE_CATEGORY[f'S{j:-02}'] = i
    i += 1

LONGITUDE_LABEL = 'longitude'
LONGITUDE_PATTERN = rf'(?P<{LONGITUDE_LABEL}>[EW]\d+)'
LONGITUDE_CATEGORY = {}
i = 0
for j in range(180, 0, -1):
    LONGITUDE_CATEGORY[f'W{j:-03}'] = i
    i += 1
for j in range(180):
    LONGITUDE_CATEGORY[f'E{j:-03}'] = i
    i += 1

SEASON_LABEL = 'season'
SEASON_CATEGORY = {'winter': 0, 'spring': 1, 'summer': 2, 'fall': 3}
SEASON_PATTERN = rf'(?P<{SEASON_LABEL}>{"|".join(SEASON_CATEGORY)})'

POLARIZATION_LABEL = 'polarization'
POLARIZATION_CATEGORY = {'vv': 0, 'hh': 1, 'vh': 2, 'hv': 3}
POLARIZATION_PATTERN = (
    rf'(?P<{POLARIZATION_LABEL}>{"|".join(POLARIZATION_CATEGORY)})'
)

COHERENCE_LABEL = 'coherence'
COHERENCE_CATEGORY = {
    'COH06': 0,
    'COH12': 1,
    'COH18': 2,
    'COH24': 3,
    'COH36': 4,
    'COH48': 5,
}
COHERENCE_PATTERN = rf'(?P<{COHERENCE_LABEL}>{"|".join(COHERENCE_CATEGORY)})'

PATTERN = (
    LATITUDE_PATTERN
    + LONGITUDE_PATTERN
    + '_'
    + SEASON_PATTERN
    + '_'
    + POLARIZATION_PATTERN
    + '_'
    + COHERENCE_PATTERN
)
martindurant commented 3 years ago

On lat/lon, is that specifically -82..+79 :

lat = np.concatenate([np.arange(1200) * 0.00083333333 + i for i in range(-82, 79)])
lon = np.concatenate([np.arange(1200) * 0.00083333333 + i for i in range(-180, 180)])

(using concat here to ensure hitting the integer degree grid)

martindurant commented 3 years ago

The following little script makes your zarr references into a full xarray dataset, which we could view via geoviews. Before writing this back into a references file, I notice that there are four polarizations listed, but the array is only of length 2 in that dimension. Not sure what happened there!

import numpy as np
import numcodecs
import ujson
import xarray as xr
import zarr

u = ujson.load(open("Global_COH.json"))

out = {f"data/{k}": v for k, v in u.items()}
g = zarr.open_group(out)

lat = np.concatenate([np.arange(1200) * 0.00083333333 + i for i in range(-82, 79)])
lon = np.concatenate([np.arange(1200) * 0.00083333333 + i for i in range(-180, 180)])

arr = g.array("longitude", data=lon, dtype="float32")
arr.attrs["_ARRAY_DIMENSIONS"] = ["longitude"]
arr = g.array("latitude", data=lat, dtype="float32")
arr.attrs["_ARRAY_DIMENSIONS"] = ["latitude"]
seasons = ['winter', 'spring', 'summer', 'fall']
arr = g.array("season", data=seasons, dtype=object, object_codec=numcodecs.VLenUTF8(), compression=None)
arr.attrs["_ARRAY_DIMENSIONS"] = ["season"]
polarization = ['vv', 'hh']
arr = g.array("polarization", data=polarization, dtype=object, object_codec=numcodecs.VLenUTF8(), compression=None)
arr.attrs["_ARRAY_DIMENSIONS"] = ["polarization"]
coherence = [6, 12, 18, 24, 36, 48]
arr = g.array("coherence", data=coherence, dtype='uint8', compression=None)
arr.attrs["_ARRAY_DIMENSIONS"] = ["coherence"]

ds = xr.open_dataset(out, engine="zarr", backend_kwargs={"consolidated": False})
ds

produces

Dimensions:       (coherence: 6, season: 4, polarization: 2, latitude: 193200, longitude: 432000)
Coordinates:
  * coherence     (coherence) float32 6.0 12.0 18.0 24.0 36.0 48.0
  * latitude      (latitude) float32 -82.0 -82.0 -82.0 -82.0 ... 79.0 79.0 79.0
  * longitude     (longitude) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
  * polarization  (polarization) object 'vv' 'hh'
  * season        (season) object 'winter' 'spring' 'summer' 'fall'
Data variables:
    data          (season, polarization, coherence, latitude, longitude) float32 ...
cgohlke commented 3 years ago

I notice that there are four polarizations listed, but the array is only of length 2 in that dimension. Not sure what happened there!

'vv' and 'hh' are the only polarizations found in the coherence COHxx files.