data_length makes less sense when data is a nested dictionary rather than a json string

getredash / redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

http://redash.io/

BSD 2-Clause "Simplified" License

25.44k stars 4.3k forks source link

data_length makes less sense when data is a nested dictionary rather than a json string #7030

Open zachliu opened 3 weeks ago

zachliu commented 3 weeks ago

Issue Summary

Before this PR https://github.com/getredash/redash/pull/6687, the data returned by query runners are json strings. Hence the data_length calculated by len(data) makes sense:

https://github.com/getredash/redash/blob/60a12e906efb8f7948fdbe5e013249b8b0c0089a/redash/tasks/queries/execution.py#L194-L200

But after https://github.com/getredash/redash/pull/6687, data is a nested dictionary. And len(data) only gives the number of keys it has. In most cases, there are only two keys, "columns" and "rows", so the data_length doesn't really give us useful information.

Steps to Reproduce

Search for data_length= in your logs.

Technical details:

Redash Version: 24.06.0-dev

zachliu commented 2 weeks ago

I replaced len(data) with

def _get_size_iterative(dict_obj):
    """Iteratively finds size of objects in bytes"""
    seen = set()
    size = 0
    objects = deque([dict_obj])

    while objects:
        current = objects.popleft()
        if id(current) in seen:
            continue
        seen.add(id(current))
        size += sys.getsizeof(current)

        if isinstance(current, dict):
            objects.extend(current.keys())
            objects.extend(current.values())
        elif hasattr(current, '__dict__'):
            objects.append(current.__dict__)
        elif hasattr(current, '__iter__') and not isinstance(current, (str, bytes, bytearray)):
            objects.extend(current)

    return size

It works fine. The in-memory dictionary size is usually a lot larger than in-disk storage size such as a csv file due to Python's in-memory storage overheads but at least it gives us a relative value especially informative because I'm using data_length in a DataDog dashboard to monitor user's query result sizes