Open LucaMaurelliC opened 9 months ago
Memory mapping is used for local filesystem, it will use "named" mmap
instead of directly call readAt
on file
For file in cloud, maybe enable PreBuffer
could help. It will merge collapsed IO and send IO by large chunk
What do you mean by collapsed IO? Also, does ffspec and its implemented fs Azure Blob Storage exploits the PreBuffering already? From the arrow doc: "Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool. This option is only supported for use_legacy_dataset=False. If using a filesystem layer that itself performs readahead (e.g. fsspec’s S3FS), disable readahead for best results."
Yeah, arrow has internal "pre_buffer` config, enabling it will making read-parquet issue all neccessary IO and buffer them in memory
During read a local file, arrow might just call ReadAt
to read the local parquet small page, because it regard local read as a lightweight operation. The same straitegy might causing lots of Get
calls for cloud storage. So it will try to "collapse" the read request: it will merge adjacent together to avoid fragment read calls.
Describe the enhancement requested
I'm learning how to exploit the arrow library in Python to read .parquet files concurrently from a local file system or a cloud file system and reading about memory mapping here I didn't understand whether I should enable it or not, and why. What are the pros and cons.
Would you mind elaborating more on this points so that users are able to evaluate and decide if that is important?
Component(s)
Documentation