Azure / azure-data-lake-store-python

Microsoft Azure Data Lake Store Filesystem Library for Python
MIT License
69 stars 70 forks source link

Unsuitable behaviour for reading parts of files #198

Closed xhochy closed 6 years ago

xhochy commented 7 years ago

Description

When using the package to access the Azure Data Lake Store to read Apache Parquet files with pyarrow, it can be seen that we retrieve significantly more data than needed on selective reads. Using columnar file formats like Apache Parquet are essentials for analytic access as the engine reading these files can already push parts of the query down into the storage layer. Data that would be pruned by the engine is not even retrieved from the storage.

As an example, we read for an example 50MB Parquet file the last 65536 bytes to load the metadata. This is then cached in RAM in an AzureDLFile object. As a follow-up to this, pyarrow now requests the relevant columns, starting with the first in the file. These columns are in roughly 8MB chunks which would be suitable to be loaded in independent requests. But instead of only loading the relevant data, the code in https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/core.py#L768-L773 will load the data from the beginning of the first relevant column until the start of the previous request. In most cases, this request spans 95% of the file and so subsequent requests are then done as the data already resides in memory.

For columnar reads, this situation is unfortunate as we have transferred nearly all data but for the successful read of the file, we may only need 10% of the file. In the usual case, such selective reads are not wanted as they increase the number of requests heavily but for formats like Parquet they are essential to take full benefit of the format. For comparison, the S3 implementation in the newest Hadoop version let's the user select different input policies: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A_Experimental_fadvise_input_policy_support In most of the cases, one would chose sequential here but for analytics reads/queries, random gives a much better performance.


Reproduction Steps

from azure.datalake.store import lib
from azure.datalake.store.core import AzureDLFileSystem

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Add secrets and store_name …
token = lib.auth(tenant_id, client_id=client_id, client_secret=client_secret)
adl = AzureDLFileSystem(token, store_name=store_name)

size = 1000000

df = pd.DataFrame({"col1": np.random.randint(0, 100, size=size),
                   "col2": np.random.randint(0, 100, size=size),
                   "col3": np.random.randn(size),
                   "col4": np.random.randn(size),
                   "col5": np.random.randn(size),
                   "col6": np.random.randn(size),
                   "col7": np.random.randn(size),
                   "col8": np.random.randn(size),
                   "col9": np.random.randn(size),
                   "col10": np.random.randn(size)})

table = pa.Table.from_pandas(df)
filename = 'test.parquet'
with adl.open(filename, 'wb') as f:
    pq.write_table(table, f)

# This should only read a small selection of the file, currently it reads about 90% 
with adl.open(filename, 'rb') as f:
    table2 = pq.read_table(f, columns=['col2', 'col7'])

Environment summary

SDK Version: 0.0.17

Python Version: 64bit, Python 3.6

OS Version: macOS Sierra

Shell Type: zsh

asikaria-msft commented 7 years ago

Ack on the issue, this is an incorrect assumption in the reading code. Thanks for the detailed report and investigation!

milanchandna commented 6 years ago

Hi, This issue has been fixed in new version 0.0.18 available here. Hence closing this issue. Please try out and add your feedback here. Thanks, Azure Team.