Bug when parallelizing CatalogImage().geotiff() requests

colllin commented 5 years ago

gbdxtools 0.16.6, python 3.7.2, ubuntu 16.04

Description

I'm attempting to parallelize the downloading of a large number of small images using multiprocessing.pool.ThreadPool (multiprocessing.Pool completely freezes when using CatalogImage().geotiff()).

For example:

import multiprocessing
import numpy as np
import gbdxtools
import matplotlib.pyplot as plt
import tifffile

os.makedirs('temp')

# with multiprocessing.Pool(processes=10) as pool:
with multiprocessing.pool.ThreadPool(processes=10) as pool:
    def download_image(bbox, path):
        catid = np.random.choice([
            '105005002BDD7B00', '1050010011197200', '1030050076F94000', '1050050021DD5600'
        ])
        try:
            gbdxtools.CatalogImage(
                catid, proj='utm', 
                bbox=bbox, from_proj='EPSG:32632', 
                acomp=True, dra=True, band_type='MS', pansharpen=True
            ).geotiff(path=path, spec='rgb')
        except:
            pass

    orig_bbox = np.array([688671.0177253198, 5742425.414653837, 689175.0177253198, 5742929.414653837])
    offsets = list(range(0,1000,100))
    bboxes = [orig_bbox + d for d in offsets]
    paths = [f'temp/offset_{d}m.tif' for d in offsets]

    pool.starmap(download_image, zip(bboxes, paths))

for path in paths:
    plt.figure()
    plt.imshow(tifffile.imread(path))
    plt.suptitle(path)

Expected behavior:

I expect to see images over the requested location from among a few overlapping catalog IDs.

Actual behavior:

Sometimes the resulting geotiff is empty (all zeros/black — see offset_300m.tif below), and sometimes it contains image contents but not from the requested bbox (see offset_600m.tif below). The behavior appears to be non-deterministic. In a quick ablation experiment, I found that:

the issue does not occur when I change processes=10 to processes=1, which performs the requests in series
the issue does not occur when I restrict my images to a single catalog ID
the issue does not occur when I first read() the image, e.g. ci = CatalogImage(); ci.read(); ci.geotiff(...)

This is the output from one test run of the above code example:

colllin commented 5 years ago

Potentially we could compute() the array before writing it to disk, in to_geotiff()? Does that address the root cause though, or is there still a chance that Rasterio writers in parallel threads could misbehave on rare occasions? This question and proposed solution on stackoverflow might be facing a similar issue to what is described here.

drwelby commented 5 years ago

@colllin Do the chips have to be geotiffs?

I'm pretty sure this has something to do with GDAL not really being threadsafe. Usually the workaround would be to spawn out processes that each write a tile at a time.

I'm out until next Tuesday - let's sync up then.

colllin commented 5 years ago

No, I don't think they really need to be geotiffs. Maybe I should just read() them and write them myself? Thanks for the idea @drwelby. I'll look into it.

drwelby commented 5 years ago

I would try https://scikit-image.org/docs/dev/api/skimage.io.html#skimage.io.imsave

colllin commented 5 years ago

The tifffile package has utilities for reading, writing, and plotting any number of bands:

$ pip install tifffile

im = gbdxtools.CatalogImage(...)
px = im[band_idxs,...].read()
px = np.moveaxis(px, 0, -1)
tifffile.imshow(px)
tifffile.imsave('path/to/my.tif', px)
back_again = tifffile.imread('path/to/my.tif')

DigitalGlobe / gbdxtools