fsspec / adlfs

fsspec-compatible Azure Datake and Azure Blob Storage access
BSD 3-Clause "New" or "Revised" License
175 stars 104 forks source link

AzureHttpError - unable to read file from blob #48

Closed raybellwaves closed 4 years ago

raybellwaves commented 4 years ago

Following up from a SO Q here: https://stackoverflow.com/questions/61220615/dask-read-parquet-from-azure-blob-azurehttperror/61229497#61229497

Unfortunately, i'm still getting an AzureHttpError. Not sure if anyone here has encountered this? Unfortunately, it's persistent for me.

hayesgb commented 4 years ago

This is the first report I've gotten of this error. Noted mdurant's suggestion, which seems the most likely explanation. Can I assume your filepath is formatted as: "abfs://{filesystem_name}/file.parquet"? Also, which Azure region are you in?

raybellwaves commented 4 years ago

I was actually doing abfs://{filesystem_name}/file but I updated my code (and the SO Q's) to be abfs://{filesystem_name}/file.parquet. However, I still get the AzureHttpError.

I'm in East US 2.

hayesgb commented 4 years ago

For reference, I'm working in East US 2 daily without issue, so I would assume it's not an availability problem. Can you answer a few other questions?

raybellwaves commented 4 years ago

Thanks for the prompt for a MCVE.

  • What package versions are you running? (adlfs, fsspec, dask, and azure-storage-blob).

Windows 10 adlfs==0.2.0, fsspec==0.6.2, dask==2.10.1, azure-storage-blob==2.1.0

further details below

> conda list: _anaconda_depends 2019.03 py37_0 _ipyw_jlab_nb_ext_conf 0.1.0 py37_0 adal 1.2.2 pypi_0 pypi adlfs 0.2.0 pypi_0 pypi alabaster 0.7.12 py37_0 alembic 1.4.0 py_0 conda-forge anaconda custom py37_1 anaconda-client 1.7.2 py37_0 anaconda-navigator 1.9.7 py37_0 anaconda-project 0.8.4 py_0 appdirs 1.4.3 pypi_0 pypi argh 0.26.2 py37_0 arrow-cpp 0.13.0 py37h49ee12d_0 asn1crypto 1.3.0 py37_0 astroid 2.3.3 py37_0 astropy 4.0 py37he774522_0 atomicwrites 1.3.0 py37_1 attrs 19.3.0 py_0 autopep8 1.4.4 py_0 azure-common 1.1.25 pypi_0 pypi azure-core 1.2.2 pypi_0 pypi azure-datalake-store 0.0.48 pypi_0 pypi azure-storage-blob 2.1.0 pypi_0 pypi azure-storage-common 2.1.0 pypi_0 pypi babel 2.8.0 py_0 backcall 0.1.0 py37_0 backports 1.0 py_2 backports.functools_lru_cache 1.6.1 py_0 backports.os 0.1.1 py37_0 backports.shutil_get_terminal_size 1.0.0 py37_2 backports.tempfile 1.0 py_1 backports.weakref 1.0.post1 py_1 bcrypt 3.1.7 py37he774522_0 beautifulsoup4 4.8.2 py37_0 bitarray 1.2.1 py37he774522_0 bkcharts 0.2 py37_0 black 19.10b0 pypi_0 pypi blackcellmagic 0.0.2 pypi_0 pypi blas 1.0 mkl bleach 3.1.0 py37_0 blosc 1.16.3 h7bd577a_0 bokeh 1.4.0 py37_0 boost-cpp 1.67.0 hfa6e2cd_4 boto 2.49.0 py37_0 bottleneck 1.3.1 py37h8c2d366_0 brotli 1.0.7 h33f27b4_0 bzip2 1.0.8 he774522_0 ca-certificates 2020.4.5.1 hecc5488_0 conda-forge certifi 2020.4.5.1 py37hc8dfbb8_0 conda-forge cffi 1.14.0 py37h7a1dbc1_0 chardet 3.0.4 py37_1003 click 7.0 py37_0 cloudpickle 1.3.0 py_0 clyent 1.2.2 py37_1 colorama 0.4.3 py_0 colorcet 2.0.2 py_0 comtypes 1.1.7 py37_0 conda 4.8.3 py37hc8dfbb8_1 conda-forge conda-build 3.18.11 py37_0 conda-env 2.6.0 1 conda-package-handling 1.6.0 py37h62dcd97_0 conda-verify 3.4.2 py_1 configparser 3.7.3 py37_1 conda-forge console_shortcut 0.1.1 3 contextlib2 0.6.0.post1 py_0 cryptography 2.8 py37h7a1dbc1_0 cudatoolkit 10.1.243 h74a9793_0 curl 7.68.0 h2a8f88b_0 cx-oracle 7.3.0 pypi_0 pypi cycler 0.10.0 py37_0 cymem 2.0.2 py37h74a9793_0 cython 0.29.15 py37ha925a31_0 cython-blis 0.2.4 py37hfa6e2cd_1 fastai cytoolz 0.10.1 py37he774522_0 dask 2.10.1 py_0 dask-core 2.10.1 py_0 databricks-cli 0.9.1 py_0 conda-forge dataclasses 0.6 py_0 fastai decorator 4.4.1 py_0 defusedxml 0.6.0 py_0 diff-match-patch 20181111 py_0 distributed 2.10.0 py_0 doc8 0.8.0 pypi_0 pypi docker-py 4.1.0 py37_0 conda-forge docker-pycreds 0.4.0 py_0 conda-forge docutils 0.16 py37_0 double-conversion 3.1.5 ha925a31_1 entrypoints 0.3 py37_0 et_xmlfile 1.0.1 py37_0 fastai 1.0.60 1 fastai fastcache 1.1.0 py37he774522_0 fastparquet 0.3.3 py37hc8d92b1_0 conda-forge fastprogress 0.2.2 py_0 fastai filelock 3.0.12 py_0 flake8 3.7.9 py37_0 flask 1.1.1 py_0 freetype 2.9.1 ha9979f8_1 fsspec 0.6.2 py_0 future 0.18.2 py37_0 get_terminal_size 1.0.0 h38e98db_0 gevent 1.4.0 py37he774522_0 gflags 2.2.2 ha925a31_0 gitdb2 3.0.2 py_0 conda-forge gitpython 3.0.5 py_0 conda-forge glob2 0.7 py_0 glog 0.4.0 h33f27b4_0 gorilla 0.3.0 py_0 conda-forge greenlet 0.4.15 py37hfa6e2cd_0 h5py 2.10.0 py37h5e291fa_0 hdf5 1.10.4 h7ebc959_0 heapdict 1.0.1 py_0 holoviews 1.12.7 py_0 html5lib 1.0.1 py37_0 hvplot 0.5.2 py_0 conda-forge hypothesis 5.4.1 py_0 icc_rt 2019.0.0 h0cc432a_1 icu 58.2 ha66f8fd_1 idna 2.8 py37_0 imageio 2.6.1 py37_0 imagesize 1.2.0 py_0 importlib_metadata 1.5.0 py37_0 intel-openmp 2020.0 166 intervaltree 3.0.2 py_0 ipykernel 5.1.4 py37h39e3cac_0 ipython 7.12.0 py37h5ca1d4c_0 ipython_genutils 0.2.0 py37_0 ipywidgets 7.5.1 py_0 isodate 0.6.0 pypi_0 pypi isort 4.3.21 py37_0 itsdangerous 1.1.0 py37_0 jdcal 1.4.1 py_0 jedi 0.14.1 py37_0 jinja2 2.11.1 py_0 joblib 0.14.1 py_0 jpeg 9b hb83a4c4_2 json5 0.9.1 py_0 jsonschema 3.2.0 py37_0 jupyter 1.0.0 py37_7 jupyter_client 5.3.4 py37_0 jupyter_console 6.1.0 py_0 jupyter_core 4.6.1 py37_0 jupyterlab 1.2.6 pyhf63ae98_0 jupyterlab_server 1.0.6 py_0 keyring 21.1.0 py37_0 kiwisolver 1.1.0 py37ha925a31_0 krb5 1.17.1 hc04afaa_0 lazy-object-proxy 1.4.3 py37he774522_0 libarchive 3.3.3 h0643e63_5 libboost 1.67.0 hfd51bdf_4 libcurl 7.68.0 h2a8f88b_0 libiconv 1.15 h1df5818_7 liblief 0.9.0 ha925a31_2 libpng 1.6.37 h2a8f88b_0 libprotobuf 3.6.0 h1a1b453_0 libsodium 1.0.16 h9d3ae62_0 libspatialindex 1.9.3 h33f27b4_0 libssh2 1.8.2 h7a1dbc1_0 libtiff 4.1.0 h56a325e_0 libxml2 2.9.9 h464c3ec_0 libxslt 1.1.33 h579f668_0 llvmlite 0.31.0 py37ha925a31_0 locket 0.2.0 py37_1 lxml 4.5.0 py37h1350720_0 lz4-c 1.8.1.2 h2fa13f4_0 lzo 2.10 h6df0209_2 m2w64-gcc-libgfortran 5.3.0 6 m2w64-gcc-libs 5.3.0 7 m2w64-gcc-libs-core 5.3.0 7 m2w64-gmp 6.1.0 2 m2w64-libwinpthread-git 5.0.0.4634.697f757 2 mako 1.1.0 py_0 conda-forge markupsafe 1.1.1 py37he774522_0 matplotlib 3.1.3 py37_0 matplotlib-base 3.1.3 py37h64f37c6_0 mccabe 0.6.1 py37_1 menuinst 1.4.16 py37he774522_0 mistune 0.8.4 py37he774522_0 mkl 2020.0 166 mkl-service 2.3.0 py37hb782905_0 mkl_fft 1.0.15 py37h14836fe_0 mkl_random 1.1.0 py37h675688f_0 mlflow 1.6.0 pypi_0 pypi mock 4.0.1 py_0 more-itertools 8.2.0 py_0 mpmath 1.1.0 py37_0 msgpack-python 0.6.1 py37h74a9793_1 msrest 0.6.11 pypi_0 pypi msys2-conda-epoch 20160418 1 multipledispatch 0.6.0 py37_0 murmurhash 1.0.2 py37h33f27b4_0 navigator-updater 0.2.1 py37_0 nbconvert 5.6.1 py37_0 nbformat 5.0.4 py_0 networkx 2.4 py_0 ninja 1.9.0 py37h74a9793_0 nltk 3.4.5 py37_0 nose 1.3.7 py37_2 notebook 6.0.3 py37_0 numba 0.48.0 py37h47e9c7a_0 numexpr 2.7.1 py37h25d0782_0 numpy 1.18.1 py37h93ca92e_0 numpy-base 1.18.1 py37hc3f5095_1 numpydoc 0.9.2 py_0 nvidia-ml-py3 7.352.0 py_0 fastai oauthlib 3.1.0 pypi_0 pypi olefile 0.46 py37_0 openpyxl 3.0.3 py_0 openssl 1.1.1f hfa6e2cd_0 conda-forge packaging 20.1 py_0 pandas 1.0.1 py37h47e9c7a_0 pandoc 2.2.3.2 0 pandocfilters 1.4.2 py37_1 param 1.9.3 py_0 paramiko 2.6.0 py37_0 parso 0.5.2 py_0 partd 1.1.0 py_0 path 13.1.0 py37_0 path.py 12.4.0 0 pathlib2 2.3.5 py37_0 pathspec 0.7.0 pypi_0 pypi pathtools 0.1.2 py_1 patsy 0.5.1 py37_0 pbr 5.4.4 pypi_0 pypi pep8 1.7.1 py37_0 pexpect 4.8.0 py37_0 pickleshare 0.7.5 py37_0 pillow 7.0.0 py37hcc1f983_0 pip 20.0.2 py37_1 pkginfo 1.5.0.1 py37_0 plac 0.9.6 py37_0 pluggy 0.13.1 py37_0 ply 3.11 py37_0 powershell_shortcut 0.0.1 2 preshed 2.0.1 py37h33f27b4_0 prometheus_client 0.7.1 py_0 prometheus_flask_exporter 0.12.2 py_0 conda-forge prompt_toolkit 3.0.3 py_0 properscoring 0.1 py_0 conda-forge protobuf 3.6.0 py37he025d50_1 conda-forge psutil 5.6.7 py37he774522_0 py 1.8.1 py_0 py-lief 0.9.0 py37ha925a31_2 pyarrow 0.13.0 py37ha925a31_0 pycodestyle 2.5.0 py37_0 pycosat 0.6.3 py37he774522_0 pycparser 2.19 py37_0 pycrypto 2.6.1 py37hfa6e2cd_9 pyct 0.4.6 py37_0 pycurl 7.43.0.5 py37h7a1dbc1_0 pydocstyle 4.0.1 py_0 pyflakes 2.1.1 py37_0 pygments 2.5.2 py_0 pyjwt 1.7.1 pypi_0 pypi pylint 2.4.4 py37_0 pynacl 1.3.0 py37h62dcd97_0 pyodbc 4.0.30 py37ha925a31_0 pyopenssl 19.1.0 py37_0 pyparsing 2.4.6 py_0 pypiwin32 223 pypi_0 pypi pyqt 5.9.2 py37h6538335_2 pyreadline 2.1 py37_1 pyrsistent 0.15.7 py37he774522_0 pysocks 1.7.1 py37_0 pytables 3.6.1 py37h1da0976_0 pytest 5.3.5 py37_0 pytest-arraydiff 0.3 py37h39e3cac_0 pytest-astropy 0.8.0 py_0 pytest-astropy-header 0.1.2 py_0 pytest-doctestplus 0.5.0 py_0 pytest-openfiles 0.4.0 py_0 pytest-remotedata 0.3.2 py37_0 python 3.7.6 h60c2a47_2 python-dateutil 2.8.1 py_0 python-editor 1.0.4 py_0 conda-forge python-jsonrpc-server 0.3.4 py_0 python-language-server 0.31.7 py37_0 python-libarchive-c 2.8 py37_13 python-snappy 0.5.4 py37hd25c944_1 conda-forge python_abi 3.7 1_cp37m conda-forge pytorch 1.4.0 py3.7_cuda101_cudnn7_0 pytorch pytz 2019.3 py_0 pyviz_comms 0.7.3 py_0 pywavelets 1.1.1 py37he774522_0 pywin32 227 py37he774522_1 pywin32-ctypes 0.2.0 py37_1000 pywinpty 0.5.7 py37_0 pyyaml 5.3 py37he774522_0 pyzmq 18.1.1 py37ha925a31_0 qdarkstyle 2.8 py_0 qt 5.9.7 vc14h73c81de_0 qtawesome 0.6.1 py_0 qtconsole 4.6.0 py_1 qtpy 1.9.0 py_0 querystring_parser 1.2.4 py_0 conda-forge re2 2019.08.01 vc14ha925a31_0 regex 2020.1.8 pypi_0 pypi requests 2.22.0 py37_1 requests-oauthlib 1.3.0 pypi_0 pypi restructuredtext-lint 1.3.0 pypi_0 pypi rope 0.16.0 py_0 rtree 0.9.3 py37h21ff451_0 ruamel_yaml 0.15.87 py37he774522_0 scikit-image 0.16.2 py37h47e9c7a_0 scikit-learn 0.22.1 py37h6288b17_0 scipy 1.4.1 py37h9439919_0 seaborn 0.10.0 py_0 send2trash 1.5.0 py37_0 setuptools 45.2.0 py37_0 simplegeneric 0.8.1 py37_2 simplejson 3.17.0 py37hfa6e2cd_0 conda-forge singledispatch 3.4.0.3 py37_0 sip 4.19.8 py37h6538335_0 six 1.14.0 py37_0 smmap2 2.0.5 py_0 conda-forge snappy 1.1.7 h777316e_3 snowballstemmer 2.0.0 py_0 sortedcollections 1.1.2 py37_0 sortedcontainers 2.1.0 py37_0 soupsieve 1.9.5 py37_0 spacy 2.1.8 py37he980bc4_0 fastai sphinx 2.4.0 py_0 sphinxcontrib 1.0 py37_1 sphinxcontrib-applehelp 1.0.1 py_0 sphinxcontrib-devhelp 1.0.1 py_0 sphinxcontrib-htmlhelp 1.0.2 py_0 sphinxcontrib-jsmath 1.0.1 py_0 sphinxcontrib-qthelp 1.0.2 py_0 sphinxcontrib-serializinghtml 1.1.3 py_0 sphinxcontrib-websupport 1.2.0 py_0 spyder 4.0.1 py37_0 spyder-kernels 1.8.1 py37_0 sqlalchemy 1.3.13 py37he774522_0 sqlite 3.31.1 he774522_0 sqlparse 0.3.0 py_0 conda-forge srsly 0.1.0 py37h6538335_0 fastai statsmodels 0.11.0 py37he774522_0 stevedore 1.32.0 pypi_0 pypi sympy 1.5.1 py37_0 tabulate 0.8.6 py_0 conda-forge tbb 2020.0 h74a9793_0 tblib 1.6.0 py_0 terminado 0.8.3 py37_0 testpath 0.4.4 py_0 thinc 7.0.8 py37he980bc4_0 fastai thrift 0.11.0 py37h6538335_1001 conda-forge thrift-cpp 0.11.0 h1ebf3fd_3 tk 8.6.8 hfa6e2cd_0 toml 0.10.0 pypi_0 pypi toolz 0.10.0 py_0 torchvision 0.5.0 py37_cu101 pytorch tornado 6.0.3 py37he774522_3 tqdm 4.42.1 py_0 traitlets 4.3.3 py37_0 typed-ast 1.4.1 pypi_0 pypi ujson 1.35 py37hfa6e2cd_0 unicodecsv 0.14.1 py37_0 urllib3 1.25.8 py37_0 vc 14.1 h0510ff6_4 vs2015_runtime 14.16.27012 hf0eaf9b_1 waitress 1.4.3 py_0 conda-forge wasabi 0.2.2 py_0 fastai watchdog 0.10.2 py37_0 wcwidth 0.1.8 py_0 webencodings 0.5.1 py37_1 websocket-client 0.57.0 py37_0 conda-forge werkzeug 1.0.0 py_0 wheel 0.34.2 py37_0 widgetsnbextension 3.5.1 py37_0 win_inet_pton 1.1.0 py37_0 win_unicode_console 0.5 py37_0 wincertstore 0.2 py37_0 winpty 0.4.3 4 wrapt 1.11.2 py37he774522_0 xarray 0.15.0 py_0 conda-forge xlrd 1.2.0 py37_0 xlsxwriter 1.2.7 py_0 xlwings 0.17.1 py37_0 xlwt 1.3.0 py37_0 xskillscore 0.0.15 py_0 conda-forge xz 5.2.4 h2fa13f4_4 yaml 0.1.7 hc54c509_2 yapf 0.28.0 py_0 zeromq 4.3.1 h33f27b4_3 zict 1.0.0 py_0 zipp 2.2.0 py_0 zlib 1.2.11 h62dcd97_3 zstd 1.3.7 h508b16e_0
  • Are you running Dask locally or distributed? If distributed, what version.

distributed (2.10.1) using a LocalCluster.

from dask.distributed import Client
client = Client()
  • Is this parquet file one that was written to abfs with Dask? If no, does a simple read-write operation with another file work, and how was the existing parquet file created? If yes, does a read-write operation to-from CSV work successfully?
  • Have you recreated the problem with a minimal working example (small example dummy dataframe)? If so can you share that example so I can try to re-create your issue?

Good questions. I tackle then both in the MCVE code.

I get EmptyDataError: No columns to parse from file with the csv files and AzureHttpError: Server encountered an internal error with the parquet file.


import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client
client = Client()

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)

STORAGE_OPTIONS={'account_name': 'ACCOUNT_NAME',
                 'account_key': 'ACCOUNT_KEY'}
# This works fine and I see the files in Microsoft Azure Storage Explorer
dd.to_csv(df=ddf,
          filename='abfs://BLOB/FILE/*.csv',
          storage_options=STORAGE_OPTIONS)

df = dd.read_csv('abfs://tmp/tmp2/*.csv', storage_options=STORAGE_OPTIONS)
---------------------------------------------------------------------------
EmptyDataError                            Traceback (most recent call last)
<ipython-input-33-4ef0af5e9369> in <module>
----> 1 df = dd.read_csv('abfs://tmp/tmp2/*.csv', storage_options=STORAGE_OPTIONS)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    442 
    443     # Use sample to infer dtypes and check for presence of include_path_column
--> 444     head = reader(BytesIO(b_sample), **kwargs)
    445     if include_path_column and (include_path_column in head.columns):
    446         raise ValueError(

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    446 
    447     # Create the parser.
--> 448     parser = TextFileReader(fp_or_buf, **kwds)
    449 
    450     if chunksize or iterator:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    878             self.options["has_index_names"] = kwds["has_index_names"]
    879 
--> 880         self._make_engine(self.engine)
    881 
    882     def close(self):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
   1112     def _make_engine(self, engine="c"):
   1113         if engine == "c":
-> 1114             self._engine = CParserWrapper(self.f, **self.options)
   1115         else:
   1116             if engine == "python":

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

EmptyDataError: No columns to parse from file

# This works and I see it in Microsoft Azure Storage Explorer
dd.to_parquet(df=df,
              path='abfs://BLOB/FILE.parquet',
              storage_options=STORAGE_OPTIONS)

df = dd.read_parquet('abfs://tmp/tmp.parquet',
                     storage_options=STORAGE_OPTIONS)
ERROR:azure.storage.common.storageclient:Client-Request-ID=fe8a8c36-8120-11ea-a33c-a0afbd853445 Retry policy did not allow for a retry: Server-Timestamp=Sat, 18 Apr 2020 03:03:08 GMT, Server-Request-ID=a5160140-d01e-006b-642d-1518c8000000, HTTP status code=500, Exception=Server encountered an internal error. Please try again after some time. ErrorCode: InternalError<?xml version="1.0" encoding="utf-8"?><Error><Code>InternalError</Code><Message>Server encountered an internal error. Please try again after some time.RequestId:a5160140-d01e-006b-642d-1518c8000000Time:2020-04-18T03:03:09.2047334Z</Message></Error>.
AzureHttpError                            Traceback (most recent call last)
<ipython-input-35-0b3e24138208> in <module>
      1 df = dd.read_parquet('abfs://tmp/tmp.parquet',
----> 2                      storage_options=STORAGE_OPTIONS)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, chunksize, **kwargs)
    231         filters=filters,
    232         split_row_groups=split_row_groups,
--> 233         **kwargs
    234     )
    235     if meta.index.name is not None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py in read_metadata(fs, paths, categories, index, gather_statistics, filters, **kwargs)
    176         # correspond to a row group (populated below).
    177         parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
--> 178             fs, paths, gather_statistics, **kwargs
    179         )
    180 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py in _determine_pf_parts(fs, paths, gather_statistics, **kwargs)
    127                 open_with=fs.open,
    128                 sep=fs.sep,
--> 129                 **kwargs.get("file", {})
    130             )
    131             if gather_statistics is None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fastparquet\api.py in __init__(self, fn, verify, open_with, root, sep)
    109                 fn2 = join_path(fn, '_metadata')
    110                 self.fn = fn2
--> 111                 with open_with(fn2, 'rb') as f:
    112                     self._parse_header(f, verify)
    113                 fn = fn2

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in open(self, path, mode, block_size, cache_options, **kwargs)
    722                 autocommit=ac,
    723                 cache_options=cache_options,
--> 724                 **kwargs
    725             )
    726             if not ac:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in _open(self, path, mode, block_size, autocommit, cache_options, **kwargs)
    552             autocommit=autocommit,
    553             cache_options=cache_options,
--> 554             **kwargs,
    555         )
    556 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in __init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, **kwargs)
    582             cache_type=cache_type,
    583             cache_options=cache_options,
--> 584             **kwargs,
    585         )
    586 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in __init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, **kwargs)
    954         if mode == "rb":
    955             if not hasattr(self, "details"):
--> 956                 self.details = fs.info(path)
    957             self.size = self.details["size"]
    958             self.cache = caches[cache_type](

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in info(self, path, **kwargs)
    499         if out:
    500             return out[0]
--> 501         out = self.ls(path, detail=True, **kwargs)
    502         path = path.rstrip("/")
    503         out1 = [o for o in out if o["name"].rstrip("/") == path]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in ls(self, path, detail, invalidate_cache, delimiter, **kwargs)
    446             # then return the contents
    447             elif self._matches(
--> 448                 container_name, path, as_directory=True, delimiter=delimiter
    449             ):
    450                 logging.debug(f"{path} appears to be a directory")

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in _matches(self, container_name, path, as_directory, delimiter)
    386             prefix=path,
    387             delimiter=delimiter,
--> 388             num_results=None,
    389         )
    390 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\blob\baseblobservice.py in list_blob_names(self, container_name, prefix, num_results, include, delimiter, marker, timeout)
   1360                   '_context': operation_context,
   1361                   '_converter': _convert_xml_to_blob_name_list}
-> 1362         resp = self._list_blobs(*args, **kwargs)
   1363 
   1364         return ListGenerator(resp, self._list_blobs, args, kwargs)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\blob\baseblobservice.py in _list_blobs(self, container_name, prefix, marker, max_results, include, delimiter, timeout, _context, _converter)
   1435         }
   1436 
-> 1437         return self._perform_request(request, _converter, operation_context=_context)
   1438 
   1439     def get_blob_account_information(self, container_name=None, blob_name=None, timeout=None):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    444                                  status_code,
    445                                  exception_str_in_one_line)
--> 446                     raise ex
    447             finally:
    448                 # If this is a location locked operation and the location is not set,

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    372                 except AzureException as ex:
    373                     retry_context.exception = ex
--> 374                     raise ex
    375                 except Exception as ex:
    376                     retry_context.exception = ex

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    358                         # and raised as an azure http exception
    359                         _http_error_handler(
--> 360                             HTTPError(response.status, response.message, response.headers, response.body))
    361 
    362                     # Parse the response

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\_error.py in _http_error_handler(http_error)
    113     ex.error_code = error_code
    114 
--> 115     raise ex
    116 
    117 

AzureHttpError: Server encountered an internal error. Please try again after some time. ErrorCode: InternalError
<?xml version="1.0" encoding="utf-8"?><Error><Code>InternalError</Code><Message>Server encountered an internal error. Please try again after some time.
RequestId:a5160140-d01e-006b-642d-1518c8000000
Time:2020-04-18T03:03:09.2047334Z</Message></Error>
hayesgb commented 4 years ago

I've just attempted to reproduce your example, but it worked on my end. Below is my code and results:

import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()

storage_options = <DEFINED>

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)
dd.to_csv(df=ddf,
          filename='abfs://<container>/test_csvfile/*.csv',
          storage_options=storage_options)
df2 = dd.read_csv("abfs://datascience-dev/test_csvfile/*.csv", storage_options=storage_options)
df2.head() <returns successfully in Jupyter Notebook>

dd.to_parquet(ddf,
          'abfs://datascience-dev/testfile.parquet',
          storage_options=storage_options)

df3 = dd.read_parquet("abfs://datascience-dev/testfile.parquet",
                     storage_options=storage_options)
df3.head() <returns successfully in Jupyter Notebook>

This was run on Linux with Anaconda Python. Python v3.6.7. Confirmed it works on my Windows 10 as well.

Versions of adlfs, fsspec, azure-storage-blob == 2.1.0, azure-common==1.1.24, and azure-datalake-store==0.0.48. I see that you have azure-core installed, which I do not have installed, and is not a dependency. You may want to try removing. Looking through other packages that are logical suspects, I also have requests 2.23 rather than 2.22.

I will investigate further later today.

raybellwaves commented 4 years ago

Thanks a lot for running. Regarding the packages I’ll try in a new env

raybellwaves commented 4 years ago

Here's a new environment. Slightly different error message but same thing along the lines of file(s) not found?

Create new env:

> conda create -n adlfs python=3.8
> conda activate adlfs
> pip install adlfs
> conda install -c conda-forge dask fastparquet ipython

Check packages:

> conda list: adal 1.2.2 pypi_0 pypi adlfs 0.2.0 pypi_0 pypi azure-common 1.1.25 pypi_0 pypi azure-datalake-store 0.0.48 pypi_0 pypi azure-storage-blob 2.1.0 pypi_0 pypi azure-storage-common 2.1.0 pypi_0 pypi backcall 0.1.0 py_0 conda-forge bokeh 2.0.1 py38h32f6830_0 conda-forge ca-certificates 2020.4.5.1 hecc5488_0 conda-forge certifi 2020.4.5.1 py38h32f6830_0 conda-forge cffi 1.14.0 pypi_0 pypi chardet 3.0.4 pypi_0 pypi click 7.1.1 pyh8c360ce_0 conda-forge cloudpickle 1.3.0 py_0 conda-forge colorama 0.4.3 py_0 conda-forge cryptography 2.9 pypi_0 pypi cytoolz 0.10.1 py38hfa6e2cd_0 conda-forge dask 2.14.0 py_0 conda-forge dask-core 2.14.0 py_0 conda-forge decorator 4.4.2 py_0 conda-forge distributed 2.14.0 py38h32f6830_0 conda-forge fastparquet 0.3.3 py38hc8d92b1_0 conda-forge freetype 2.10.1 ha9979f8_0 conda-forge fsspec 0.7.2 py_0 conda-forge heapdict 1.0.1 py_0 conda-forge idna 2.9 pypi_0 pypi intel-openmp 2020.0 166 ipython 7.13.0 py38h32f6830_2 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jedi 0.17.0 py38h32f6830_0 conda-forge jinja2 2.11.2 pyh9f0ad1d_0 conda-forge jpeg 9c hfa6e2cd_1001 conda-forge libblas 3.8.0 15_mkl conda-forge libcblas 3.8.0 15_mkl conda-forge liblapack 3.8.0 15_mkl conda-forge libpng 1.6.37 hfe6a214_1 conda-forge libtiff 4.1.0 h885aae3_6 conda-forge llvmlite 0.31.0 py38h32f6830_1 conda-forge locket 0.2.0 py_2 conda-forge lz4-c 1.9.2 h33f27b4_0 conda-forge markupsafe 1.1.1 py38h9de7a3e_1 conda-forge mkl 2020.0 166 msgpack-python 1.0.0 py38heaebd3c_1 conda-forge numba 0.48.0 py38he350917_0 conda-forge numpy 1.18.1 py38ha749109_1 conda-forge olefile 0.46 py_0 conda-forge openssl 1.1.1f hfa6e2cd_0 conda-forge packaging 20.1 py_0 conda-forge pandas 1.0.3 py38he6e81aa_1 conda-forge parso 0.7.0 pyh9f0ad1d_0 conda-forge partd 1.1.0 py_0 conda-forge pickleshare 0.7.5 py38h32f6830_1001 conda-forge pillow 7.1.1 py38h8103267_0 conda-forge pip 20.0.2 py38_1 prompt-toolkit 3.0.5 py_0 conda-forge psutil 5.7.0 py38h9de7a3e_1 conda-forge pycparser 2.20 pypi_0 pypi pygments 2.6.1 py_0 conda-forge pyjwt 1.7.1 pypi_0 pypi pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge python 3.8.2 h5fd99cc_11 python-dateutil 2.8.1 py_0 conda-forge python_abi 3.8 1_cp38 conda-forge pytz 2019.3 py_0 conda-forge pyyaml 5.3.1 py38h9de7a3e_0 conda-forge requests 2.23.0 pypi_0 pypi setuptools 46.1.3 py38_0 six 1.14.0 py_1 conda-forge sortedcontainers 2.1.0 py_0 conda-forge sqlite 3.31.1 he774522_0 tblib 1.6.0 py_0 conda-forge thrift 0.11.0 py38h6538335_1001 conda-forge tk 8.6.10 hfa6e2cd_0 conda-forge toolz 0.10.0 py_0 conda-forge tornado 6.0.4 py38hfa6e2cd_0 conda-forge traitlets 4.3.3 py38h32f6830_1 conda-forge typing_extensions 3.7.4.1 py38h32f6830_3 conda-forge urllib3 1.25.9 pypi_0 pypi vc 14.1 h0510ff6_4 vs2015_runtime 14.16.27012 hf0eaf9b_1 wcwidth 0.1.9 pyh9f0ad1d_0 conda-forge wheel 0.34.2 py38_0 wincertstore 0.2 py38_0 xz 5.2.5 h2fa13f4_0 conda-forge yaml 0.2.3 he774522_0 conda-forge zict 2.0.0 py_0 conda-forge zlib 1.2.11 h2fa13f4_1006 conda-forge zstd 1.4.4 h9f78265_3 conda-forge

Setup code:

import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()

storage_options = <DEFINED>

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)

csv example:

dd.to_csv(df=ddf,
          filename='abfs://<container>/test_csvfile/*.csv',
          storage_options=storage_options)
df2 = dd.read_csv('abfs://<container>/test_csvfile/*.csv',
                  storage_options=storage_options)

Error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\csv.py", line 566, in read
    return read_pandas(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\csv.py", line 398, in read_pandas
    b_out = read_bytes(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\bytes\core.py", line 96, in read_bytes
    raise IOError("%s resolved to no files" % urlpath)
OSError: abfs://<container>/test_csvfile/*.csv resolved to no files

Print a few things using %debug:

ipdb> urlpath
'abfs://tmp/test_csvfile/*.csv'
ipdb> paths
[]
ipdb> b_lineterminator
b'\n'

parquet example:

dd.to_parquet(ddf,
             'abfs://<container>/testfile.parquet',
              storage_options=storage_options)

df3 = dd.read_parquet("abfs://<container>/testfile.parquet",
                      storage_options=storage_options)

Error message:

>>> df3 = dd.read_parquet("abfs://<container>/testfile.parquet", storage_options=storage_options)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\core.py", line 225, in read_parquet
    meta, statistics, parts = engine.read_metadata(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 202, in read_metadata
    parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 147, in _determine_pf_parts
    base, fns = _analyze_paths(paths, fs)
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\utils.py", line 405, in _analyze_paths
    basepath = path_parts_list[0][:-1]
IndexError: list index out of range

Print a few things using %debug:

ipdb> path_parts_list
[]
ipdb> file_list
[]
ipdb> paths
[]
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000019872422C70>
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\core.py(225)read_parquet()
    223         index = [index]
    224
--> 225     meta, statistics, parts = engine.read_metadata(
    226         fs,
    227         paths,

ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000019872422C70>
ipdb> paths
['tmp/testfile.parquet']
ipdb> gather_statistics
ipdb> 
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py(147)_determine_pf_parts()
    145         # This is a directory, check for _metadata, then _common_metadata
    146         paths = fs.glob(paths[0] + fs.sep + "*")
--> 147         base, fns = _analyze_paths(paths, fs)
    148         if "_metadata" in fns:
    149             # Using _metadata file (best-case scenario)

ipdb> paths
[]
ipdb> u
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py(202)read_metadata()
    200         # then each part will correspond to a file.  Otherwise, each part will
    201         # correspond to a row group (populated below).
--> 202         parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
    203             fs, paths, gather_statistics, **kwargs
    204         )

ipdb> paths
['tmp/testfile.parquet']
ipdb> paths[0]
'tmp/testfile.parquet'

It seems paths moves from ['tmp/testfile.parquet'] to [] at some point. I think around https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/fastparquet.py#L146

I'll try pyarrow

raybellwaves commented 4 years ago

Create new env:

> conda create -n adlfs-pa python=3.8
> conda activate adlfs-pa
> pip install adlfs
> conda install -c conda-forge dask pyarrow ipython

Check packages:

> conda list: abseil-cpp 20200225.1 he025d50_2 conda-forge adal 1.2.2 pypi_0 pypi adlfs 0.2.0 pypi_0 pypi arrow-cpp 0.16.0 py38hd3bb158_3 conda-forge aws-sdk-cpp 1.7.164 vc14h867dc94_1 [vc14] conda-forge azure-common 1.1.25 pypi_0 pypi azure-datalake-store 0.0.48 pypi_0 pypi azure-storage-blob 2.1.0 pypi_0 pypi azure-storage-common 2.1.0 pypi_0 pypi backcall 0.1.0 py_0 conda-forge bokeh 2.0.1 py38h32f6830_0 conda-forge boost-cpp 1.72.0 h0caebb8_0 conda-forge brotli 1.0.7 he025d50_1001 conda-forge bzip2 1.0.8 hfa6e2cd_2 conda-forge c-ares 1.15.0 h2fa13f4_1001 conda-forge ca-certificates 2020.4.5.1 hecc5488_0 conda-forge certifi 2020.4.5.1 py38h32f6830_0 conda-forge cffi 1.14.0 pypi_0 pypi chardet 3.0.4 pypi_0 pypi click 7.1.1 pyh8c360ce_0 conda-forge cloudpickle 1.3.0 py_0 conda-forge colorama 0.4.3 py_0 conda-forge cryptography 2.9 pypi_0 pypi curl 7.69.1 h1dcc11c_0 conda-forge cytoolz 0.10.1 py38hfa6e2cd_0 conda-forge dask 2.14.0 py_0 conda-forge dask-core 2.14.0 py_0 conda-forge decorator 4.4.2 py_0 conda-forge distributed 2.14.0 py38h32f6830_0 conda-forge freetype 2.10.1 ha9979f8_0 conda-forge fsspec 0.7.2 py_0 conda-forge gflags 2.2.2 he025d50_1002 conda-forge glog 0.4.0 h0174b99_3 conda-forge grpc-cpp 1.28.1 hb1a2610_1 conda-forge heapdict 1.0.1 py_0 conda-forge idna 2.9 pypi_0 pypi intel-openmp 2020.0 166 ipython 7.13.0 py38h32f6830_2 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jedi 0.17.0 py38h32f6830_0 conda-forge jinja2 2.11.2 pyh9f0ad1d_0 conda-forge jpeg 9c hfa6e2cd_1001 conda-forge krb5 1.17.1 hdd46e55_0 conda-forge libblas 3.8.0 15_mkl conda-forge libcblas 3.8.0 15_mkl conda-forge libcurl 7.69.1 h1dcc11c_0 conda-forge liblapack 3.8.0 15_mkl conda-forge libpng 1.6.37 hfe6a214_1 conda-forge libprotobuf 3.11.4 h1a1b453_0 conda-forge libssh2 1.8.2 h642c060_2 conda-forge libtiff 4.1.0 h885aae3_6 conda-forge locket 0.2.0 py_2 conda-forge lz4-c 1.9.2 h33f27b4_0 conda-forge markupsafe 1.1.1 py38h9de7a3e_1 conda-forge mkl 2020.0 166 msgpack-python 1.0.0 py38heaebd3c_1 conda-forge numpy 1.18.1 py38ha749109_1 conda-forge olefile 0.46 py_0 conda-forge openssl 1.1.1f hfa6e2cd_0 conda-forge packaging 20.1 py_0 conda-forge pandas 1.0.3 py38he6e81aa_1 conda-forge parquet-cpp 1.5.1 2 conda-forge parso 0.7.0 pyh9f0ad1d_0 conda-forge partd 1.1.0 py_0 conda-forge pickleshare 0.7.5 py38h32f6830_1001 conda-forge pillow 7.1.1 py38h8103267_0 conda-forge pip 20.0.2 py38_1 prompt-toolkit 3.0.5 py_0 conda-forge psutil 5.7.0 py38h9de7a3e_1 conda-forge pyarrow 0.16.0 py38h57df961_2 conda-forge pycparser 2.20 pypi_0 pypi pygments 2.6.1 py_0 conda-forge pyjwt 1.7.1 pypi_0 pypi pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge python 3.8.2 h5fd99cc_11 python-dateutil 2.8.1 py_0 conda-forge python_abi 3.8 1_cp38 conda-forge pytz 2019.3 py_0 conda-forge pyyaml 5.3.1 py38h9de7a3e_0 conda-forge re2 2020.04.01 vc14h6538335_0 [vc14] conda-forge requests 2.23.0 pypi_0 pypi setuptools 46.1.3 py38_0 six 1.14.0 py_1 conda-forge snappy 1.1.8 he025d50_1 conda-forge sortedcontainers 2.1.0 py_0 conda-forge sqlite 3.31.1 he774522_0 tblib 1.6.0 py_0 conda-forge thrift-cpp 0.13.0 h1907cbf_2 conda-forge tk 8.6.10 hfa6e2cd_0 conda-forge toolz 0.10.0 py_0 conda-forge tornado 6.0.4 py38hfa6e2cd_0 conda-forge traitlets 4.3.3 py38h32f6830_1 conda-forge typing_extensions 3.7.4.1 py38h32f6830_3 conda-forge urllib3 1.25.9 pypi_0 pypi vc 14.1 h0510ff6_4 vs2015_runtime 14.16.27012 hf0eaf9b_1 wcwidth 0.1.9 pyh9f0ad1d_0 conda-forge wheel 0.34.2 py38_0 wincertstore 0.2 py38_0 xz 5.2.5 h2fa13f4_0 conda-forge yaml 0.2.3 he774522_0 conda-forge zict 2.0.0 py_0 conda-forge zlib 1.2.11 h2fa13f4_1006 conda-forge zstd 1.4.4 h9f78265_3 conda-forge

Setup code:

import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()

storage_options = <DEFINED>

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)

csv example:

dd.to_csv(df=ddf,
          filename='abfs://tmp/test_csvfile/*.csv',
          storage_options=storage_options)
df2 = dd.read_csv('abfs://tmp/test_csvfile/*.csv',
                  storage_options=storage_options)

Same error as above

parquet example:

dd.to_parquet(ddf,
             'abfs://tmp/testfile.parquet',
              storage_options=storage_options)

df3 = dd.read_parquet("abfs://tmp/testfile.parquet",
                      storage_options=storage_options)

Same error as above

Some output of %debug:

> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\utils.py(405)_analyze_paths()
    403     path_parts_list = [_join_path(fn).split("/") for fn in file_list]
    404     if root is False:
--> 405         basepath = path_parts_list[0][:-1]
    406         for i, path_parts in enumerate(path_parts_list):
    407             j = len(path_parts) - 1
ipdb> path_parts_list
[]
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\arrow.py(129)_determine_dataset_parts()
    127         # This is a directory, check for _metadata, then _common_metadata
    128         allpaths = fs.glob(paths[0] + fs.sep + "*")
--> 129         base, fns = _analyze_paths(allpaths, fs)
    130         if "_metadata" in fns and "validate_schema" not in dataset_kwargs:
    131             dataset_kwargs["validate_schema"] = False
ipdb> allpaths
[]
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\arrow.py(220)read_metadata()
    218         # then each part will correspond to a file.  Otherwise, each part will
    219         # correspond to a row group (populated below)
--> 220         parts, dataset = _determine_dataset_parts(
    221             fs, paths, gather_statistics, filters, kwargs.get("dataset", {})
    222         )
ipdb> paths
['tmp/testfile.parquet']
ipdb> parts
*** NameError: name 'parts' is not defined
ipdb> dataset
*** NameError: name 'dataset' is not defined
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000020136448D60>
ipdb> gather_statistics
ipdb> filters
ipdb>  

pyarrow over fastparquet doesn't seem to matter.

raybellwaves commented 4 years ago

Just tested reading the csv file and worked on my linux machine. Although got the AzureHttpError for the parquet file. I was also curious about path

> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/fsspec/spec.py(542)info()
    540         if out:
    541             return out[0]
--> 542         out = self.ls(path, detail=True, **kwargs)
    543         path = path.rstrip("/")
    544         out1 = [o for o in out if o["name"].rstrip("/") == path]

ipdb> path                                                                                                                    
'tmp/testfile.parquet/_metadata/_metadata'
> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/adlfs/core.py(576)__init__()
    574         self.blob = blob
    575 
--> 576         super().__init__(
    577             fs=fs,
    578             path=path,

ipdb> fs                                                                                                                      
<adlfs.core.AzureBlobFileSystem object at 0x7efdfca6fe80>
ipdb> path                                                                                                                    
'tmp/testfile.parquet/_metadata/_metadata'
ipdb>     
> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py(202)read_metadata()
    200         # then each part will correspond to a file.  Otherwise, each part will
    201         # correspond to a row group (populated below).
--> 202         parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
    203             fs, paths, gather_statistics, **kwargs
    204         )

ipdb> paths                                                                                                                   
['tmp/testfile.parquet']
raybellwaves commented 4 years ago

reading the csv file worked fine on my Mac. Same AzureHttpError on the parquet file.

I see there are two things here:

hayesgb commented 4 years ago

I've spent some time on this today. I can replicate your issue on my Windows machine, but it works as expected on Ubuntu and my Mac. I've found one compatibility issue with the 0.7.2 release of fsspec, which I will work on fixing tomorrow. Currently comparing package dependencies between Windows and Linux.

hayesgb commented 4 years ago

I just uploaded v0.2.2. Give it a shot and let me know if it works for you. Seems there was an issue with parsing container names in Windows, which should be fixed. Also found a change in fsspec v0.6.3 that is causing adlfs to fail one of its unit tests. Need to verify everything is OK before I allow fsspec >= 0.6.3, so pinned to fsspec0.6.0 to 0.6.2. Let me know if it solves your issue.

raybellwaves commented 4 years ago

Thanks. I'll try tomorrow

raybellwaves commented 4 years ago

Thanks @hayesgb! I was able to read in the csv file on my windows.

Going to move the parquet file read to a separate issue