Augment dataframes with metadata from the origin file

Right now, a third party process saves out json files into an S3 bucket for us. The filename looks like <prefix>-<iso_datetime>.json.gz, and each file is the output of an endpoint that specifies a time-value pair. For example, the file might look like:

[{"time": "2024-06-21T00:00:00Z", "value": 1.0}, ...]

We're loading these files in via wr.s3.read_json and ideally we want to be able to take the latest updated value for any particular time. Or in pandas terminology:

we_want = df_dataset.groupby("fileModifiedTime").last()

I don't think this is possible right now, because there is no way to get information from either filename or file metadata using the wr.s3.read_json method.

If we move away from awswrangler, we can do something like this using pyarrow.dataset, as the dataset has a files attribute and you can call pyarrow_dataset.filesystem.get_file_info(pyarrow_dataset.files)

Describe the solution you'd like

It would be great if there was some way to augment the returned dataframe with metadata coming from the files that were loaded, such as the LastModifiedTime.

aws / aws-sdk-pandas

Augment dataframes with metadata from the origin file #2865