aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.9k stars 693 forks source link

Calling wr.s3.read_parquet_metadata with a path that doesn't exist throws IndexError #2842

Closed lucasmo closed 3 months ago

lucasmo commented 3 months ago

Describe the bug

This happens if the path passed to read_parquet_metadata doesn't exist:

[ERROR] IndexError: list index out of range
Traceback (most recent call last):
  File "/var/task/obscured/obscured.py", line 123, in do_a_thing
    column_types, _ = wr.s3.read_parquet_metadata(
  File "/opt/python/awswrangler/_config.py", line 715, in wrapper
    return function(**args)
  File "/opt/python/awswrangler/_utils.py", line 178, in inner
    return func(*args, **kwargs)
  File "/opt/python/awswrangler/s3/_read_parquet.py", line 846, in read_parquet_metadata
    columns_types, partitions_types, _ = _read_parquet_metadata(
  File "/opt/python/awswrangler/s3/_read_parquet.py", line 140, in _read_parquet_metadata
    return reader.read_table_metadata(
  File "/opt/python/awswrangler/s3/_read.py", line 280, in read_table_metadata
    merged_schemas = _validate_schemas(schemas=schemas, validate_schema=False)
  File "/opt/python/awswrangler/s3/_read.py", line 304, in _validate_schemas
    first: pa.schema = schemas[0]

How to Reproduce

awswrangler.s3.read_parquet_metadata(path='s3://bucket-you-can-read/file-that-doesnt-exist')

Expected behavior

An exception like exceptions.NoFilesFound is thrown, or perhaps some kind of empty result? It's unclear what the correct behavior here should be, but it's not throwing an IndexError :)

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.11

AWS SDK for pandas version

3.7.3

Additional context

No response