Integration with WATCH - Githubissues

Erotemic commented 2 years ago

@SuYe99 I'm looking at how to best integrate this code with watch, and I'm trying to connect the steps your taking in prepare_ard.py with the data we have available.

Something prepare_ard does it it makes a lot of assumptions about filenames. This is something that kwcoco helps to abstract away. Filename assumptions should be fine for the packed data (although I'd recommend to avoid it, as my opinion is that it is usually an anti-pattern), so the goal is just to get data from a kwcoco file into something pycold will accept.

The following snippets are what I've done so far. First here is a bit of demo code that will build a kwcoco dataset with landsat8 data assuming you have the watch repo and appropriate AWS credentials to pull data from the USGS stac catalog.

(Note, it does depend on a few tweaks and fixes I haven't landed in main yet, so let me know if you actually need to run this and I'll land those).

        DATASET_SUFFIX=DemoKHQ-2022-06-10-V2
        DEMO_DPATH=$HOME/.cache/watch/demo/datasets
        REGION_FPATH="$HOME/.cache/watch/demo/annotations/KHQ_R001.geojson"
        SITE_GLOBSTR="$HOME/.cache/watch/demo/annotations/KHQ_R001_sites/*.geojson"
        START_DATE=$(jq -r '.features[] | select(.properties.type=="region") | .properties.start_date' "$REGION_FPATH")
        END_DATE=$(jq -r '.features[] | select(.properties.type=="region") | .properties.end_date' "$REGION_FPATH")
        REGION_ID=$(jq -r '.features[] | select(.properties.type=="region") | .properties.region_id' "$REGION_FPATH")
        SEARCH_FPATH=$DEMO_DPATH/stac_search.json
        RESULT_FPATH=$DEMO_DPATH/all_sensors_kit/${REGION_ID}.input

        mkdir -p "$DEMO_DPATH"

        # Create the search json wrt the sensors and processing level we want
        python -m watch.stac.stac_search_builder \
            --start_date="$START_DATE" \
            --end_date="$END_DATE" \
            --cloud_cover=40 \
            --sensors=landsat-c2ard-bt,landsat-c2ard-sr \
            --out_fpath "$SEARCH_FPATH"
        cat "$SEARCH_FPATH"

        # Delete this to prevent duplicates
        rm -f "$RESULT_FPATH"
        # Create the .input file
        python -m watch.cli.stac_search \
            -rf "$REGION_FPATH" \
            -sj "$SEARCH_FPATH" \
            -m area \
            --verbose 2 \
            -o "${RESULT_FPATH}"

        # Construct the TA2-ready dataset
        python -m watch.cli.prepare_ta2_dataset \
            --dataset_suffix=$DATASET_SUFFIX \
            --s3_fpath "${RESULT_FPATH}" \
            --collated False \
            --dvc_dpath="$DEMO_DPATH" \
            --aws_profile=iarpa \
            --region_globstr="$REGION_FPATH" \
            --site_globstr="$SITE_GLOBSTR" \
            --fields_workers=8 \
            --convert_workers=8 \
            --align_workers=26 \
            --cache=0 \
            --ignore_duplicates=0 \
            --visualize=True \
            --requester_pays=True \
            --api_key=None \
            --serial=True --run=0

This makes a small dataset of the Kitware HQ building and an associated ~/.cache/watch/demo/datasets/Aligned-DemoKHQ-2022-06-10-V2/data.kwcoco.json kwcoco dataset, which registers all the paths to all of the images along with metadata about what sensor they come from and when they were taken.

Given this I wrote a small script to attempt to emulate what you are doing in prepare_ard to construct blocked arrays for a single image:

    import ubelt as ub
    import kwcoco
    import numpy as np
    fpath = ub.Path('~/.cache/watch/demo/datasets/Aligned-DemoKHQ-2022-06-10-V2/data.kwcoco.json').expand()
    dset = kwcoco.CocoDataset(fpath)

    # available channels
    # sr_qa_aerosol,qa_pixel,qa_radsat
    # 'coastal|blue|green|red|nir|swir16|swir22|lwir11|lwir12'

    for vidid in dset.videos():
        images = dset.images(vidid=vidid)

        for coco_image in images.coco_images:
            print(f'coco_image.channels={coco_image.channels}')
            assert coco_image.img['sensor_coarse'] == 'L8'
            intensity_channels = 'blue|green|red|nir|swir16|swir22|lwir11'
            quality_channels = 'qa_pixel'
            delayed_intensity = coco_image.delay(channels=intensity_channels, space='video', nodata_method='ma')
            delayed_qa = coco_image.delay(
                channels=quality_channels, space='video', nodata_method='ma', antialias=False)

            # hacks: should be available in main API
            delayed_intensity._set_nested_params(interpolation='cubic')
            delayed_qa._set_nested_params(interpolation='nearest')

            delayed_intensity = delayed_intensity.optimize()
            delayed_qa = delayed_qa.optimize()

            intensity_data = delayed_intensity.finalize()
            qa_data = delayed_qa.finalize()
            im_band = np.concatenate([intensity_data, qa_data], axis=2)
            im_band_data = im_band.data
            # im_band_mask = im_band.mask

            h, w, c = im_band_data.shape
            n_block_x = 20
            n_block_y = 20

            padded_w = int(np.ceil(w / n_block_x) * n_block_x)
            padded_h = int(np.ceil(h / n_block_y) * n_block_y)

            # Pad data out to fit in block sizes
            pad_x = padded_w - w
            pad_y = padded_h - h
            padded_data = np.pad(im_band_data, [(0, pad_y), (0, pad_x), (0, 0)])

            bw = int(padded_w / n_block_x)  # width of a block
            bh = int(padded_h / n_block_y)  # height of a block

            import einops
            blocks = einops.rearrange(padded_data, '(nby bh) (nbx bw) c -> nbx nby bh bw c', bw=bw, bh=bh)

Note that in a prepared kwcoco file Using CocoImage.delay(space='video') will return an object that will resample any bands (that may exist at different scales on disk) on the fly. This resample will also take care of aligning pixels across multiple images in a sequence according to their geo-metadata.

This means I can load all of the data in a fairly concise command. The next step is to construct the block structure you expect. I'm using einops to express the appropriate operation. It's probably not as efficient as your implementation, but writing it this way will let us generalize and reduce the number of assumptions we make. I think we can improve the efficiency if it turns out to be a bottleneck.

It looks like I'll also need to check the QA band to determine if I should write a block or not.

What would be helpful is a precise description of the file format that you would expect. It looks like there needs to be a starting_last_dates.txt with the times for the images in the sequence, and then the images are stored in "partition subfolders" where each image has shape (NY, NX, BH, BW, C) in npy format?

I'm wondering if it would be inefficient to use hdf5 here instead? Perhaps the npy format is just superior?

Also how are you handling data that doesn't have a shape where a blocksize neatly divides it? I'm just using some padding, there probably is a more efficient way to handle that part.

SuYe99 commented 2 years ago

Yes, we need to check QA. I have wrote several functions to deal with QA for HLS or Landsat dataset (they are different!). It is very common that a block (I set 122 * 122 pixels for HLS for now) contain 0 valid pixels (valid = no-cloud/shadow/snow or water pixels), so we skipped saving the block to reduce storage as well as computing need. 'starting_last_dates.txt' is only useful for OB-COLD (another algorithm I developed) to acquire 'global' starting and ending date for a image tile as different blocks may different starting and ending dates. I think it is not useful so far for WATCH project. I was ever considering hdf5 for S-CCD outputs to replace 'pickle' for saving large dictionaries, In term of the outputs of prepare_ard, I chose npy just because I am more familiar with it. I guess that hdf5 is less efficient but more compact so that less data storage would be. If the data storage is a big concern to WATCH, definitely we should consider HDF5, instead of npy. I forced the users to input a block size that must be divisible to total rows and cols. See here https://github.com/SuYe99/pycold/blob/0748cde7317be697fafb8f9567e32a45ad60ecc1/src/python/pycold/imagetool/PycoldProcessingTile.py#L213 It is absolutely not an elegant way. I set it as it worked for all the satellite images we touched so far. Yes, padding is a better way to go. BTW, I can't get access WATCH repo (except pycold repo). It may affect my testing your kwcoco dataset?

Erotemic commented 2 years ago

I send you an invite for the group: https://gitlab.kitware.com/smart

I think npy is going to be the fastest option, so let's continue using that. I've used hdf5 in the past, but I've always found it to be slower, but I do agree it would be nicer than using pickle for large dictionaries, especially if they are array like in nature.

Your link isn't working for me, but it's good to know I didn't miss the padding.

One tweak that I made to kwcoco to get the above example working for me was to add better support for numpy masked arrays. It still needs more work, but I think handling them efficiently will go a long way towards building a robust algorithm. It also helps with the padding issue, because if instead of padding with zeros, we can pad with a masked value which should tell the algorithm that there isn't any data there.

Masked values can be treated as nans. I was working with another person a while ago to add nan support to scikit-learn RandomForests https://github.com/scikit-learn/scikit-learn/pull/5974 but it never got merged. However, having that might be helpful here, as I see you are using RF classifiers.

Another detail for QA is that the QA band for the SMART project might be different still. It probably makes sense to abstract how QA bands are interpreted. But there still will likely be per-sensor corner cases. I think kwcoco can help abstract a lot of that away though. We can also make a flavor of kwcoco toydata that simulates a sensor so we can use a doctest to work on developing a kwcoco compatible API for pycold.

GERSL / pycold

Integration with WATCH #7