Closed xhochy closed 6 years ago
Ack on the issue, this is an incorrect assumption in the reading code. Thanks for the detailed report and investigation!
Hi, This issue has been fixed in new version 0.0.18 available here. Hence closing this issue. Please try out and add your feedback here. Thanks, Azure Team.
Description
When using the package to access the Azure Data Lake Store to read Apache Parquet files with
pyarrow
, it can be seen that we retrieve significantly more data than needed on selective reads. Using columnar file formats like Apache Parquet are essentials for analytic access as the engine reading these files can already push parts of the query down into the storage layer. Data that would be pruned by the engine is not even retrieved from the storage.As an example, we read for an example 50MB Parquet file the last 65536 bytes to load the metadata. This is then cached in RAM in an
AzureDLFile
object. As a follow-up to this,pyarrow
now requests the relevant columns, starting with the first in the file. These columns are in roughly 8MB chunks which would be suitable to be loaded in independent requests. But instead of only loading the relevant data, the code in https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/core.py#L768-L773 will load the data from the beginning of the first relevant column until the start of the previous request. In most cases, this request spans 95% of the file and so subsequent requests are then done as the data already resides in memory.For columnar reads, this situation is unfortunate as we have transferred nearly all data but for the successful read of the file, we may only need 10% of the file. In the usual case, such selective reads are not wanted as they increase the number of requests heavily but for formats like Parquet they are essential to take full benefit of the format. For comparison, the S3 implementation in the newest Hadoop version let's the user select different input policies: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A_Experimental_fadvise_input_policy_support In most of the cases, one would chose
sequential
here but for analytics reads/queries,random
gives a much better performance.Reproduction Steps
Environment summary
SDK Version: 0.0.17
Python Version: 64bit, Python 3.6
OS Version: macOS Sierra
Shell Type: zsh