Open samjeckert opened 3 years ago
Hi @samjeckert , is there any chance there is something concurrently modifying the object?
There is no chance unfortunately. I temporarily got around this by downloading the file and then just using pyarrow iter_batches
directly.
fh, temp_file = tempfile.mkstemp()
os.close(fh)
wr.s3.download(
key,
temp_file,
use_threads=True,
boto3_session=boto3_session,
)
pfile = pq.ParquetFile(temp_file)
for table in pfile.iter_batches():
yield table.to_pandas()
Why is this closed without fix? I dont want to Download the Data First. I want to bei able to iterate over a list of parquet files. Awswrangler gives the mentioned error when used with chunked=True
Hi @JahnKhan, I am unable to reproduce the issue - during all my tests with various sizes and chunksize values all works fine, that suggests this is a data issue or concurrent modification of the object in S3.
Is this intermittent or consistent in your case? Does it happen only on specific data? Can you provide steps to reproduce?
I know this is closed from last year, but I'm getting this issue too, I have some parquet files with 1M records in s3, and when trying to read_parquet
with chunked=True
, it throws the error, like it was trying to read the entire parquet file instead of by chunks.
@juan-yunis can you provide an example to reproduce this?
@kukushking I created these parquet files with chunk_size=1000000
, and after that I was reading them with chunked=True
, but I get the Length mismatch
error, it seems it's trying to load the entire file into memory instead of just the default chunksize (65535).
@kukushking Check this out, also this same is working fine in version 1.8.1.
Yes, This is happening with me in AWS Wrangler
I have a Parquet file with 5000 rows, and want the file in chunks of 1000. It throws me
Length mismatch: Expected axis has 1000 elements, new values have 5000 elements
Where 1000 being my chunk size and 5000 being my total rows in parquet file.
The script looks like :
dfs = awswrangler.s3.read_parquet(path=f"s3://{self.bucket}/{self.object_key}",
chunked=self.chunk_size)
for data in dfs:
print(data)
When looping against the chunked generator, this error is thrown.
Note: I have ensured the file path is correct and this works fine in AWS Wrangler version 1.8.1
@Ritish-Madan the fact that it works in 1.8.1 does give me additional info to be able to investigate this -- I'll have a look at what's changed, thanks!
This issue was also happening for me on the latest awswrangler
version (2.16.1). After some debugging I discovered that the "length mismatch" error only happens with newer versions of pyarrow
. I was runnning pyarrow
7.0.0, and the error occurs with pyarrow >= 3.0
(I believe, I don't have the exact breaking version).
The reason why awswrangler
version 1.8.1 works is that it lists pyarrow~=1.0.0
as a requirement. When I combined awswrangler==2.16.1
and pyarrow==2.0.0
the chunked
parameter works as intended with integer arguments.
@kukushking is there any plan to address this?
Describe the bug
When reading a large parquet file from S3 using read_parquet, I get errors like
ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements
. Expected axis value matches the integer value of chunked (or 65_536 ifchunked=True
).Traceback:
Environment
Provide your
pip list
output, particularly the version of the AWS Data Wrangler library you used. Providing this information may significantly improve resolution times.To Reproduce
Steps to reproduce the behavior.
Failing code:
P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.