[Documentation][Parquet] Reading parquet and memory mapping

apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

https://arrow.apache.org/

Apache License 2.0

14.31k stars 3.48k forks source link

[Documentation][Parquet] Reading parquet and memory mapping #39005

Open LucaMaurelliC opened 9 months ago

LucaMaurelliC commented 9 months ago

Describe the enhancement requested

I'm learning how to exploit the arrow library in Python to read .parquet files concurrently from a local file system or a cloud file system and reading about memory mapping here I didn't understand whether I should enable it or not, and why. What are the pros and cons.

Would you mind elaborating more on this points so that users are able to evaluate and decide if that is important?

Component(s)

Documentation

mapleFU commented 9 months ago

Memory mapping is used for local filesystem, it will use "named" mmap instead of directly call readAt on file

For file in cloud, maybe enable PreBuffer could help. It will merge collapsed IO and send IO by large chunk

LucaMaurelliC commented 9 months ago

What do you mean by collapsed IO? Also, does ffspec and its implemented fs Azure Blob Storage exploits the PreBuffering already? From the arrow doc: "Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool. This option is only supported for use_legacy_dataset=False. If using a filesystem layer that itself performs readahead (e.g. fsspec’s S3FS), disable readahead for best results."

mapleFU commented 9 months ago

Yeah, arrow has internal "pre_buffer` config, enabling it will making read-parquet issue all neccessary IO and buffer them in memory

During read a local file, arrow might just call ReadAt to read the local parquet small page, because it regard local read as a lightweight operation. The same straitegy might causing lots of Get calls for cloud storage. So it will try to "collapse" the read request: it will merge adjacent together to avoid fragment read calls.