Length mismatch when using read_parquet chunked

samjeckert commented 3 years ago

Describe the bug

When reading a large parquet file from S3 using read_parquet, I get errors like ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements. Expected axis value matches the integer value of chunked (or 65_536 if chunked=True).

Traceback:

Traceback (most recent call last):
  File "refresh.py", line 3, in <module>
    scores.refresh_score_partitions()
  File "/Users/sameckert/aw/project_explorer/app/scores.py", line 152, in refresh_score_partitions
    for df in dfs:
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 400, in _read_parquet_chunked
    path_root=path_root,
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 295, in _arrowtable2df
    df = _apply_index(df=df, metadata=metadata)
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 224, in _apply_index
    df.index = pd.RangeIndex(start=col["start"], stop=col["stop"], step=col["step"])
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 5154, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 564, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 227, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements

Environment

Provide your pip list output, particularly the version of the AWS Data Wrangler library you used. Providing this information may significantly improve resolution times.

asn1crypto==1.4.0; python_version >= "3.6" and python_version < "3.10"
awswrangler==2.9.0; python_version >= "3.6" and python_version < "3.10"
beautifulsoup4==4.9.3; python_version >= "3.6" and python_version < "3.10"
boto3==1.17.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
botocore==1.20.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0")
certifi==2021.5.30; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
click==7.1.2; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
dataclasses==0.8; python_version >= "3.6" and python_version < "3.7" and python_full_version >= "3.6.1"
et-xmlfile==1.1.0; python_version >= "3.6" and python_version < "3.10"
fastapi==0.63.0; python_version >= "3.6"
future==0.18.2; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.3.0"
h11==0.12.0; python_version >= "3.6"
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
jmespath==0.10.0; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
llvmlite==0.36.0; python_version >= "3.6" and python_version < "3.10"
lmdb==1.2.1
lxml==4.6.3; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
mysqlclient==2.0.3; python_version >= "3.5"
nmslib==2.1.1
numba==0.53.1; python_version >= "3.6" and python_version < "3.10"
numpy==1.19.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
openpyxl==3.0.7; python_version >= "3.6" and python_version < "3.10"
pandas==1.1.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pg8000==1.19.5; python_version >= "3.6" and python_version < "3.10"
psutil==5.8.0; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
pyarrow==4.0.1; python_version >= "3.6" and python_version < "3.10"
pyathena==2.3.0; python_full_version >= "3.6.1" and python_full_version < "4.0.0"
pybind11==2.6.1; python_version >= "2.7" and python_version < "3.0" or python_version > "3.0" and python_version < "3.1" or python_version > "3.1" and python_version < "3.2" or python_version > "3.2" and python_version < "3.3" or python_version > "3.3" and python_version < "3.4" or python_version > "3.4"
pydantic==1.8.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pymysql==1.0.2; python_version >= "3.6" and python_version < "3.10"
python-dateutil==2.8.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pytz==2021.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
redis==3.5.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
redshift-connector==2.0.882; python_version >= "3.6" and python_version < "3.10"
requests==2.25.1; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
retrying==1.3.3; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
s3transfer==0.4.2; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
scramp==1.4.0; python_version >= "3.6" and python_version < "3.10"
six==1.16.0; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
soupsieve==2.2.1; python_version >= "3.6" and python_version < "3.10"
standardiser==0.1.12
starlette==0.13.6; python_version >= "3.6"
tenacity==6.3.1; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
typing-extensions==3.10.0.0; python_full_version >= "3.6.1" and python_version >= "3.6" and python_version < "3.8"
urllib3==1.26.6; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_full_version >= "3.6.0" and python_version < "3.10" and python_version >= "3.6"
uvicorn==0.13.4

To Reproduce

Steps to reproduce the behavior.

Failing code:

boto3_session = boto3.Session()
dfs = wr.s3.read_parquet(f"s3://large_file.parquet.gz", chunked=75_536, ignore_index=True, boto3_session=boto3_session)
for df in dfs:
    print(len(df.index))

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

kukushking commented 3 years ago

Hi @samjeckert , is there any chance there is something concurrently modifying the object?

samjeckert commented 3 years ago

There is no chance unfortunately. I temporarily got around this by downloading the file and then just using pyarrow iter_batches directly.

    fh, temp_file = tempfile.mkstemp()
    os.close(fh)
    wr.s3.download(
        key,
        temp_file,
        use_threads=True,
        boto3_session=boto3_session,
    )

    pfile = pq.ParquetFile(temp_file)
    for table in pfile.iter_batches():
        yield table.to_pandas()

JahnKhan commented 3 years ago

Why is this closed without fix? I dont want to Download the Data First. I want to bei able to iterate over a list of parquet files. Awswrangler gives the mentioned error when used with chunked=True

kukushking commented 3 years ago

Hi @JahnKhan, I am unable to reproduce the issue - during all my tests with various sizes and chunksize values all works fine, that suggests this is a data issue or concurrent modification of the object in S3.

Is this intermittent or consistent in your case? Does it happen only on specific data? Can you provide steps to reproduce?

juan-yunis commented 2 years ago

I know this is closed from last year, but I'm getting this issue too, I have some parquet files with 1M records in s3, and when trying to read_parquet with chunked=True, it throws the error, like it was trying to read the entire parquet file instead of by chunks.

kukushking commented 2 years ago

@juan-yunis can you provide an example to reproduce this?

juan-yunis commented 2 years ago

@kukushking I created these parquet files with chunk_size=1000000, and after that I was reading them with chunked=True, but I get the Length mismatch error, it seems it's trying to load the entire file into memory instead of just the default chunksize (65535).

Ritish-Madan commented 2 years ago

@kukushking Check this out, also this same is working fine in version 1.8.1.

Yes, This is happening with me in AWS Wrangler

I have a Parquet file with 5000 rows, and want the file in chunks of 1000. It throws me

Length mismatch: Expected axis has 1000 elements, new values have 5000 elements Where 1000 being my chunk size and 5000 being my total rows in parquet file.

The script looks like :

dfs = awswrangler.s3.read_parquet(path=f"s3://{self.bucket}/{self.object_key}",
                                              chunked=self.chunk_size)

for data in dfs:
      print(data)

When looping against the chunked generator, this error is thrown.

Note: I have ensured the file path is correct and this works fine in AWS Wrangler version 1.8.1

kukushking commented 2 years ago

@Ritish-Madan the fact that it works in 1.8.1 does give me additional info to be able to investigate this -- I'll have a look at what's changed, thanks!

NoelSAI commented 2 years ago

This issue was also happening for me on the latest awswrangler version (2.16.1). After some debugging I discovered that the "length mismatch" error only happens with newer versions of pyarrow. I was runnning pyarrow 7.0.0, and the error occurs with pyarrow >= 3.0 (I believe, I don't have the exact breaking version).

The reason why awswrangler version 1.8.1 works is that it lists pyarrow~=1.0.0 as a requirement. When I combined awswrangler==2.16.1 and pyarrow==2.0.0 the chunked parameter works as intended with integer arguments.

rupshac commented 1 year ago

@kukushking is there any plan to address this?

aws / aws-sdk-pandas

Length mismatch when using read_parquet chunked #769