dask / dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...
https://cloudprovider.dask.org
BSD 3-Clause "New" or "Revised" License
130 stars 107 forks source link

Azure : RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'LZ4', 'UNCOMPRESSED'] #177

Open arnabbiswas1 opened 3 years ago

arnabbiswas1 commented 3 years ago

Steps to reproduce:

I have created Dask Cluster inside AzureML environment using the following code:

amlcluster = AzureMLCluster(ws,
                            vm_size="STANDARD_D1",
                            environment_definition=ws.environments['AzureML-Dask-CPU'], 
                            initial_node_count=0, 
                            scheduler_idle_timeout=10800,
                            vnet='vnet',
                            subnet='subnet',
                            vnet_resource_group='resourcegroup',
                            ct_name="biswasdask",
)

Next open the jupyter lab using the link returned by amlcluster.jupyter_link

As per my understanding I am into the scheduler node of the cluster now.

On the Jupyter notebook, try the following code (from the repository azureml-examples):

from adlfs import AzureBlobFileSystem

container_name = "isdweatherdatacontainer"
storage_options = {"account_name": "azureopendatastorage"}

fs = AzureBlobFileSystem(**storage_options)
files = fs.glob(f"{container_name}/ISDWeather/year=2020/month=2/part-00003-tid-695161346761253622-368439cf-81e6-43f1-be5d-49ba29e282c0-2567-2.c000.snappy.parquet")
ddf = dd.read_parquet(files, storage_options=storage_options, chunksize="20MB")

ddf.head()

It returns the following error:

RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'LZ4', 'UNCOMPRESSED']

This is seems to be an old issue. But, since I have not created this environment manually, I don't know what is the problem?

arnabbiswas1 commented 3 years ago

This is an open source project, so I really can't complain. But, while trying to work with dask-cloudprovider (for azure), I am encountering with issues after issue at different steps. That concerns me a lot about the basic sanity and stability of the product.

Further to that I see this commit to azureml-examples repository:

"remove dask-cloudprovider givne instability and lack of support"

With this, I am not sure if I should continue my effort of trying to use dask_cloudprovider within Azure ML pipeline (as a part of my day job).

Would appreciate if anyone from the dask-cloudprovider brief about the status of the project at this point of time.

jacobtomlinson commented 3 years ago

Thanks for taking the time to raise these issues @arnabbiswas1.

Dask Cloudprovider contains cluster managers for a variety of different cloud platforms. Currently the AzureMLCluster is maintained by the AzureML team.

We are working to add a new cluster manager for Azure in #175 which will use Azure VMs directly instead of the AzureML API. The AzureML folks have indicated that they want to remove the AzureMLCluster in favour of the new more generic AzureVMCluster.

arnabbiswas1 commented 3 years ago

Thanks for your quick and detailed reply. That helps me to prioritize my work.

I will wait for the new cluster manager for Azure and then will pick it back. Will eagerly wait for it.

Thanks for all the great work you are doing. :love_you_gesture: