Data Commons only returning first 1000 file names in `get_data_commons_index()`

USEPA / esupy

A library supporting Python-based tools in USEPA's tool ecosystem

5 stars 2 forks source link

Data Commons only returning first 1000 file names in `get_data_commons_index()` #38

Closed bl-young closed 2 years ago

bl-young commented 2 years ago

Flow by activity returns more than 1000 results:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>edap-ord-data-commons</Name>
<Prefix>flowsa/FlowByActivity/</Prefix>
<Marker/>
<MaxKeys>1000</MaxKeys>
<IsTruncated>true</IsTruncated>

Note IsTruncated: True

bl-young commented 2 years ago

@WesIngwersen I'm not sure if this is something that we will be able to work around given how we look for parquet files. cc @catherinebirney

bl-young commented 2 years ago

looks like we could use boto3, the AWS python package.

It looks like there is a way to access, filter, and read the object metadata from public repositories

WesIngwersen commented 2 years ago

We could archive older files right?

bl-young commented 2 years ago

Yes, at the risk of making old models unbuildable (perhaps less concerning for FBAs where the current issue is). We currently have 2267 files in flowsa/FlowByActivity.

I believe https://github.com/USEPA/esupy/commit/44e5420206c56d18c90bf600d7563eb239618d7f solves the issue but makes us dependent on a new package

bl-young commented 2 years ago

I created a new test (#39) to access and download a file from data commons (notably I picked USDA as its towards the end alphabetically). It's currently failing on main, but passes on #40