Open Samreay opened 1 week ago
Under the hood wr.s3.read_json
is simply leveraging pd.read_json. We are limited by what this API exposes and it does not look like it allows to retrieve metadata. In any case, given how specific your ask is I would recommend to use pyarrow.dataset
instead as you suggested
Right now, a third party process saves out json files into an S3 bucket for us. The filename looks like
<prefix>-<iso_datetime>.json.gz
, and each file is the output of an endpoint that specifies a time-value pair. For example, the file might look like:We're loading these files in via
wr.s3.read_json
and ideally we want to be able to take the latest updated value for any particular time. Or in pandas terminology:I don't think this is possible right now, because there is no way to get information from either filename or file metadata using the
wr.s3.read_json
method.If we move away from awswrangler, we can do something like this using
pyarrow.dataset
, as the dataset has afiles
attribute and you can callpyarrow_dataset.filesystem.get_file_info(pyarrow_dataset.files)
Describe the solution you'd like
It would be great if there was some way to augment the returned dataframe with metadata coming from the files that were loaded, such as the LastModifiedTime.