Reading a large Pickle file from S3
S3 File size: 12.5 GB
Approach 1:
Reading the file using pandas read_pickle function and passing the S3 URI as input. Pandas internally uses s3fs to read from S3.
pd.read_pickle({s3_uri})
Time taken ~16.5min (990 sec)
Approach 2:
Getting the file using boto3 and passing it directly to the pandas read_pickle function:
Why the HUGE difference?
I ran a few experiments by changing the _default_blocksize and _default_cachetype .
pd.read_pickle({s3_uri}, storage_options={"default_block_size":{block_size}, "default_cache_type":{cache_type}})
S3fs has defined the _default_blocksize as 5MB and the _default_cachetype as bytes.
The experiments suggest that changing the _default_cachetype to readahead would give a good read performance improvement. Let me know your thoughts. I also wanted to know why _bytes_ is chosen as the default cache_type for s3fs.
The following table outlines the experiments:
Note: The experiments weren't performed multiple times with the same parameters for most of the configurations, so the read times can vary by a few seconds.
Reading a large Pickle file from S3 S3 File size: 12.5 GB
Approach 1: Reading the file using pandas read_pickle function and passing the S3 URI as input. Pandas internally uses s3fs to read from S3.
pd.read_pickle({s3_uri})
Time taken ~16.5min (990 sec)Approach 2: Getting the file using boto3 and passing it directly to the pandas read_pickle function:
Time taken ~3min (180sec)
Why the HUGE difference? I ran a few experiments by changing the _default_blocksize and _default_cachetype .
pd.read_pickle({s3_uri}, storage_options={"default_block_size":{block_size}, "default_cache_type":{cache_type}})
S3fs has defined the _default_blocksize as 5MB and the _default_cachetype as bytes.
The experiments suggest that changing the _default_cachetype to readahead would give a good read performance improvement. Let me know your thoughts. I also wanted to know why _bytes_ is chosen as the default cache_type for s3fs.
The following table outlines the experiments: Note: The experiments weren't performed multiple times with the same parameters for most of the configurations, so the read times can vary by a few seconds.
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
S3 File Size | Block Size | Cache Type | Total Read Time | RAM on instance -- | -- | -- | -- | -- 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) | bytes | 993 sec | 32 GB 12.5 GB | (5*2**20)*2 (10MB) | bytes | 696 sec | 32 GB 12.5 GB | (5*2**20)*4 (20MB) | bytes | 540 sec | 32 GB 12.5 GB | (5*2**20)*8 (40MB) | bytes | 536 sec | 32 GB | | | | 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) | readahead | 586 sec | 32 GB 12.5 GB | (5*2**20)*2 (10MB) | readahead | 430 sec | 32 GB 12.5 GB | (5*2**20)*4 (20MB) | readahead | 349 sec | 32 GB 12.5 GB | (5*2**20)*8 (40MB) | readahead | 466 sec | 32 GB | | | | 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) | all | Out of Memory | 32 GB 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) | all | 185 sec | 64 GB | | | | 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) | mmap | 656 sec | 32 GB 12.5 GB | (5*2**20)*8 (40MB) | mmap | 269 sec | 32 GB | | | | 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) | block | 617 sec | 32 GB 12.5 GB | (5*2**20)*8 (40MB) | block | 272 sec | 32 GB | | | | 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) | first | too long - stopped experiment | 32 GB | | | | 12.5 GB | 5*2**20 (5MB) (Default configured in S3fs) / 40 MB | parts | too long - stopped experiment | 32 GB Thanks!