aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.84k stars 677 forks source link

Augment dataframes with metadata from the origin file #2865

Open Samreay opened 1 week ago

Samreay commented 1 week ago

Right now, a third party process saves out json files into an S3 bucket for us. The filename looks like <prefix>-<iso_datetime>.json.gz, and each file is the output of an endpoint that specifies a time-value pair. For example, the file might look like:

[{"time": "2024-06-21T00:00:00Z", "value": 1.0}, ...]

We're loading these files in via wr.s3.read_json and ideally we want to be able to take the latest updated value for any particular time. Or in pandas terminology:

we_want = df_dataset.groupby("fileModifiedTime").last()

I don't think this is possible right now, because there is no way to get information from either filename or file metadata using the wr.s3.read_json method.

If we move away from awswrangler, we can do something like this using pyarrow.dataset, as the dataset has a files attribute and you can call pyarrow_dataset.filesystem.get_file_info(pyarrow_dataset.files)

Describe the solution you'd like

It would be great if there was some way to augment the returned dataframe with metadata coming from the files that were loaded, such as the LastModifiedTime.

jaidisido commented 4 days ago

Under the hood wr.s3.read_json is simply leveraging pd.read_json. We are limited by what this API exposes and it does not look like it allows to retrieve metadata. In any case, given how specific your ask is I would recommend to use pyarrow.dataset instead as you suggested