Closed paulgirard closed 5 years ago
@paulgirard Thanks! I'm going to review in this week. It looks like a promising optimization
@paulgirard Very smart. Thanks!
Do you mind adding the methods/options documentation? It would be great if the algorithm is described somewhere in the readme (e.g. as a note paragraph for the new table.index_*
method).
Also please ensure that datapackage-py
's tests pass locally with this update (I guess you did already=)
@roll Yes I'll do the documentation now that you validated it.
And Yes I did run datapackage-py tests both with and without the version proposed in the datapackage PR.
@roll I edited the README. Should be good now.
@paulgirard: Really nice! That basically saved my day, validating a csv file with ~1,5m lines containing foreign keys is now easily possible. Before, processing didn't even finish overnight. So thanks, also to @roll for releasing this so quickly. Looking forward to the related PR in datapackage :-)
Very nice to here @hjoukl ! Glad it's usefull. As for datapackage integration I have to add test and clean some mess before it can be merged. I am on it.
I have scalability issues when trying to validate a datapackage which contains one ressource with 397201 lines containing foreign keys. I needed to split my 397k lines resource into many files to organize those by source. I finally chose to use share schema and group notion to split into many resources. Trying to validate all those lines brought many scalability issues in both tableschema and datapackage libraries.
In this PR I propose a new way to check_relations which is far more efficient while respecting the exact same behaviour. The first idea is to pre-index the relations data by the values of the foreign keys. This index is calles foreign_keys_values. This index is then used to test if the row reference some of the existing value (simple hash map lookup).
There is another optimization enabled. One can pre-compute the foreign_keys_values and pass it to the read/iter methods to enable reuse the same index for many resources sharing the same schema. In my personal case my 397k lines are splitted in more than 1000 resources. This optimization is enabled by exposing the new method index_foreign_keys_values(self, relations) and by adding one optional arg to read and iter methods.
I've just realized that this PR miss updates on the documentation. Let's see if the principles are good by reviewers before updating the doc..
This PR is linked to another one I am about to create in datapackage-py.
@roll what do you think ?