datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

Some files on S3 are not publicly accessible while they should #205

Closed zelima closed 6 years ago

zelima commented 6 years ago

Originally coming from https://github.com/datahq/datahub-qa/issues/235

While processing some datapackages some of the data is public and some not, which is very strange behavior as one processor change_acl is applied to all keys prefixed {owner}/{dataset} in packagestore.

Acceptance Criteria

Tasks

Analysis

Reason

Looking at the logs of the processing flow of finance-vix I see that when the processor for changing permission runs in the end (at the beginning of the process all of them are private) it lists all the keys in the bucket with prefix finance-vix -> meaning it re-changes acl each time the new revision comes in.

Client.list_objects() method in Boto3, we are using to list the objects in the specific folder on S3 returns max 1000 items per request. While the method gives the ability to set MaxKeys parameter it effects only if number is <1000 and still returns 1000 items if it is set to Eg 2000. To get the rest of them we should set Marker parameter for it and request another chunck. Form their docs:

Solution

So the quick fix for this would be to request all the items from S3 using IsTruncated flag from the response.

Possible problem

On the other hand as the number of items getting bigger and bigger, processing of the dataset will take longer and longer time (Even though changing_acl does not really require much time)

zelima commented 6 years ago

FIXED