CNES / zcollection

Python library allowing to manipulate data split into a collection of groups stored in Zarr format.
https://zcollection.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
12 stars 3 forks source link

collection.insert() does not work if cluster is not instanciated #4

Closed robin-cls closed 1 year ago

robin-cls commented 1 year ago

Hello,

While using the tutorial, I stumbled over a strange behavior.

If we replay the tutorial, but removing the cluster instanciation, the zcollection stays empty after the insert() step

Code to reproduce

from __future__ import annotations

import datetime
import pprint

import dask.distributed
import fsspec
import numpy

import zcollection
import zcollection.tests.data

def create_dataset():
    """Create a dataset to record."""
    generator = zcollection.tests.data.create_test_dataset_with_fillvalue()
    return next(generator)

ds = create_dataset()
ds.to_xarray()

fs = fsspec.filesystem('memory')

# Here, contrary to the tutorial, we do not instanciate a client
# cluster = dask.distributed.LocalCluster(processes=False)
# client = dask.distributed.Client(cluster)

partition_handler = zcollection.partitioning.Date(('time', ), resolution='M')

collection = zcollection.create_collection('time',
                                           ds,
                                           partition_handler,
                                           '/my_collection',
                                           filesystem=fs)

collection.insert(ds)
print(collection.load())
>> <Nothing>

Expected behavior

I was expected the insert() step to be effective even if we do not instanciate a cluster.

...
collection.insert(ds)
print(collection.load())
>> <zcollection.dataset.Dataset>
>>  Dimensions: ('num_lines: 61', 'num_pixels: 25')
>>Data variables:
>>    time    (num_lines  datetime64[us]: dask.array<chunksize=(11,)>
>>    var1    (num_lines, num_pixels  float64: dask.array<chunksize=(11, 25)>
>>    var2    (num_lines, num_pixels  float64: dask.array<chunksize=(11, 25)>
>>  Attributes:
>>    attr   : 1

zcollection version 2023.3.2

fbriol commented 1 year ago

This is completely normal. If you use a memory file system, you have to use only one process, otherwise the Dask workers will write to the memory of the workers, but these are invisible for the main process.