Crunch-io / scrunch

Pythonic scripting library for cleaning data in Crunch
GNU Lesser General Public License v3.0
5 stars 7 forks source link

Improvements to join speed #414

Closed jamesrkg closed 2 years ago

jamesrkg commented 3 years ago

This code is terribly inefficient because every variable to be joined is instantiated only to get its url:

https://github.com/Crunch-io/scrunch/blob/7361a858c89fc12f7a15bd06d920c981798ad28d/scrunch/mutable_dataset.py#L102-L105

This change takes a list of >2000 columns from 10s of minutes down to about 4 seconds:

        # add the individual variable columns to the payload
        _self = right_ds.resource.self
        _variables = right_ds.resource.variables.by('alias')
        var_urls = [
            '{}variables/{}/'.format(_self, _variables[alias]['id'])
            for alias in columns
        ]
        for var_url in var_urls:
            payload['body']['args'][0]['map'][var_url] = {'variable': var_url}
jjdelc commented 2 years ago

We have a PR for this fix as well https://github.com/Crunch-io/scrunch/pull/420 - will go on the next release asap

jjdelc commented 2 years ago

Released in v0.10.0