While processing some datapackages some of the data is public and some not, which is very strange behavior as one processor change_acl is applied to all keys prefixed {owner}/{dataset} in packagestore.
Acceptance Criteria
[x] All files have expected permissions
Tasks
[x] Do an analysis
[x] Fix the bug
Analysis
Reason
Looking at the logs of the processing flow of finance-vix I see that when the processor for changing permission runs in the end (at the beginning of the process all of them are private) it lists all the keys in the bucket with prefix finance-vix -> meaning it re-changes acl each time the new revision comes in.
Client.list_objects() method in Boto3, we are using to list the objects in the specific folder on S3 returns max 1000 items per request. While the method gives the ability to set MaxKeys parameter it effects only if number is <1000 and still returns 1000 items if it is set to Eg 2000. To get the rest of them we should set Marker parameter for it and request another chunck. Form their docs:
Solution
So the quick fix for this would be to request all the items from S3 using IsTruncated flag from the response.
Possible problem
On the other hand as the number of items getting bigger and bigger, processing of the dataset will take longer and longer time (Even though changing_acl does not really require much time)
Originally coming from https://github.com/datahq/datahub-qa/issues/235
While processing some datapackages some of the data is public and some not, which is very strange behavior as one processor
change_acl
is applied to all keys prefixed{owner}/{dataset}
in packagestore.Acceptance Criteria
Tasks
Analysis
Reason
Looking at the logs of the processing flow of finance-vix I see that when the processor for changing permission runs in the end (at the beginning of the process all of them are private) it lists all the keys in the bucket with prefix
finance-vix
-> meaning it re-changes acl each time the new revision comes in.Client.list_objects()
method in Boto3, we are using to list the objects in the specific folder on S3 returns max 1000 items per request. While the method gives the ability to set MaxKeys parameter it effects only if number is <1000 and still returns 1000 items if it is set to Eg 2000. To get the rest of them we should set Marker parameter for it and request another chunck. Form their docs:Solution
So the quick fix for this would be to request all the items from S3 using
IsTruncated
flag from the response.Possible problem
On the other hand as the number of items getting bigger and bigger, processing of the dataset will take longer and longer time (Even though changing_acl does not really require much time)