DigitalSlideArchive / HistomicsStream

A whole-slide image reader for TensorFlow
Apache License 2.0
22 stars 6 forks source link

ENH: Concatenate tensorflow.Datasets more performantly #106

Closed Leengit closed 1 year ago

Leengit commented 1 year ago

In effect, the implementation being replaced concatenates a list of tensorflow.Dataset objects via:

response = None
for dataset in dataset_list:
    if response is None:
        response = dataset
    else:
        response = response.concatenate(dataset)
return response

Each tensorflow.Dataset.concatenate call has response, a large accumulating dataset, on the left and dataset, an element from the input list, on the right. At the time of model.predict(combined_dataset), this causes a resource exhaustion and kills the process on an example from @cooperlab, apparently because finding the first dataset of the list requires descending through all the non-eager concatenations.

This pull request instead recursively splits the dataset_list in half, so that the tensorflow.Dataset.concatenate call is between two equally sized combination datasets. We went for this balanced solution rather than the right-heavy solution because the latter might lead to resource exhaustion in the case that shuffling makes a dataset from near the end of the dataset_list be processed early. See the code for additional details.