Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.25k stars 774 forks source link

s3.list_recursive() lists files from folders that have the same prefix #569

Open BartBaddeley opened 3 years ago

BartBaddeley commented 3 years ago

If a folder on s3 has another folder's name as the start of its name then s3.list_recursive() also lists the files in that folder:

's3://elevate-analytics-etl-output/analytics_data/analytics_data_21_05_28_1625_dedup/' 's3://elevate-analytics-etl-output/analytics_data/analytics_data_21_05_28_1625_dedup_version_2/'

The following will return all of the files in both folders.

with S3(s3root='s3://elevate-analytics-etl-output/analytics_data/analytics_data_21_05_28_1625_dedup/') as s3: for key in s3.list_recursive():

Also https://github.com/Netflix/metaflow/blob/bd585a470468741e0a74f8d285e5560dd4d1e75a/metaflow/datatools/s3.py#L427-L457"keys: (required) a list of suffixes for paths to list."

I think should be prefixes not suffixes?

romain-intel commented 3 years ago

Thank you for this. I think this is caused by https://github.com/Netflix/metaflow/blob/master/metaflow/datatools/s3.py#L303. I will see if the fix is simply to remove this (I need to make sure I am not missing something).