frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
191 stars 43 forks source link

Scalability issues #247

Closed paulgirard closed 4 years ago

paulgirard commented 5 years ago

See the related PR in tableschema-py https://github.com/frictionlessdata/tableschema-py/pull/254

I have scalability issues when trying to validate a datapackage which contains one resource with 397201 lines containing foreign keys. I needed to split my 397k lines resource into many files to organize those by source. I finally chose to use share schema and group notion to split into many resources. Trying to validate all those lines brought many scalability issues in both tableschema and datapackage libraries.

First issue is about memory management. Checking relations in a resource, makes the object not only to read the related resource but to hold the data in memory as it is kept as an object attribute. This has the consequences to make the memory grow when checking relations of a large amount of resource. I don't know why the relations data are kept in the object so I proposed in this PR a new method to clean the relations data drop_relations.

Once the memory issue solved I had a performance one. Checking relations of my +1000 resources hodling 397k lines of data took 98m49.895s. That's very long.

So I thought about two optimizations. The first one is to avoid loading the relations for each resources which belongs to a group. To enable this optimization I propose to add a check_relations method into the Group object. Thus we load relations data once and then use this data in memory to validate all the resource belonging to that group.

Than a second optimization has been proposed into tableschema-py by the PR tableschema-py https://github.com/frictionlessdata/tableschema-py/pull/254. The idea is to pre-index the relations data by the values of the foreign keys. This index is called foreign_keys_values. This index is then used to test if the row reference some of the existing value (simple hash map lookup).

To speed up, I also propose to expose the get_foreign_keys_values() method so that the Group object can pre-compute the index only once before using it to validate all the resource.

Using those optimizations made the validation process to drop from 98m49.895s to 1m3.609s.

I've just realized that this PR miss updates on the documentation. Let's see if the principles are good by reviewers before updating the doc..

@roll what do you think ?

paulgirard commented 5 years ago

Oh yeah of course since this PR is based on an update of the tableschema-py dependency, CI failes. This PR can't be valid before the tablechema one is accepted and deployed...

paulgirard commented 4 years ago

@roll I just removed my try catch to let exceptions flow upstream. Maybe there is a policy to raise a datapackage exceptions and not tableschema ones ?