Severe performance degredation when working with large datasets

mproffitt commented 8 years ago

When plotting data sets with 2m or more rows, the performance of PyUpset is severely degraded.

During testing, each data set has 2 columns, the first is a text field containing a name as the joining column, the second a numeric value.

dataset 1: 2813027 rows
dataset 2: 2814442 rows
dataset 3: 2814467 rows
dataset 4: 2816085 rows

These datasets were generated from a full outer join on 4 other datasets, which took approx 1m 30 to create. On the contrary, pyupset took > 26 hours to count indexes before being killed.

The issue seems to come down to the method extract_intersection_data (visulisation.py lines 564 - 574) which currently use pd.Index to check for combinations.

Unless I've misunderstood the comments at the bottom of the file, these lines are already marked for refactoring to use a merge? Do you have any details of how you were planning this merge? (left, right, inner, outer) and when such an amendment might be available?

Memory warning

It should be noted that even for medium sized datasets a merge will likely run into memory issues. For example, a full outer join on the 4 datasets (details below) used to produce the row count above resulted in a memory utilisation of >64GB on 12 columns with the resulting 5 column dataframe utilising 21GB of memory.

In the counts below, each dataset had 3 columns, name | start | end

dataset 1: 48405 rows
dataset 2: 337848 rows
dataset 3: 50307 rows
dataset 4: 14260 rows

Resulting dataframe = name, (start * 4) x 2816085 == 22121.8GB

With the caveat of memory not being an issue, one way to overcome the performance degradation is to create a merge outer for each set combination and then query where the value of non-unique columns != np.nan.

Thoughts?

ImSoErgodic commented 7 years ago

Hi @mproffitt, thank you very much for pointing out the issue and apologies if I come back to you with such a delay.

That performance problem has been known to me for a while, as you guessed, and it is just for lack of time if I haven't immediately fixed them. I didn't know however what the magnitude of the bottleneck was, and from what you're saying it is evident that I should start working on it ASAP. The problem there is that index creation is a costly operation. Off the top of my head I can't remember how pandas implements the merge, but it is quite possible that in most cases it will be significantly faster than creating a bunch of indices on all columns.

This also leads me to suggest, if you haven't already done so: Specifying the join column explicitly to pyupset, or indeed passing it the minimal dataframes with all and only the information to intersect, helps improve the performance.

As for the memory issues, using pandas and dataframes as the underlying representation makes it inevitable that some operations will be slow and hit memory limits. This prompted me to have a quick look at what alternatives exist for the fast computation of set intersections. Most of what I saw refers to the intersection of 2 sets, which in pyupset's case may be problematic, as we need to compute all intersections. It may well be, on the other hand, that even doing many set intersections in sequence ends up being faster than relying on indices or merges, in which case I'll change the underlying representation of the sets. I need to do some experiments though.

I think one would still need to deal with the original dataframes throughout the whole process though, as things like subset-highlighting in graphs require accessing all the initial information jointly.

mproffitt commented 7 years ago

Hi @ImSoErgodic, thank you for getting back on this issue and no problem about the delay.

I think it's a case of swings vs roundabouts over whether a merge is faster than creating a bunch of indexes. On small data-sets indexes may be faster but as the datasets become larger, merging definitely offers performance enhancements until you hit the memory ceiling.

The below code details how I'm calling the library, with datasets being a dict of restricted dataframes (containing gene_id and start columns only).

if self._graph is not None:
    frame = pyu.plot(
        datasets,
        unique_keys=[dataset.group_by],
        additional_plots=[
            {
                'kind':'scatter',
                'data_quantities':{'x': self._graph.xdata, 'y': self._graph.ydata},
                'graph_properties':{'alpha': .8, 'lw': .4, 'edgecolor': 'w', 's': 50}
            },
            {
                'kind':'hist',
                'data_quantities':{'x': self._graph.xdata},
                'graph_properties':{'bins': 50}
            }
        ],
        query=in_sets
    )
else:
    frame = pyu.plot(
        datasets,
        unique_keys=[dataset.group_by],
        query=in_sets
    )

The below details a bit more about what I'm doing prior to calling the graph as I feel there is scope within this to drastically improve performance by offering a path around the bottleneck.

Each original datafram starts out with 19 columns * n rows, out of which I initially select only 3 columns, one of which is unique.

Once selected, the 4 resulting datasets are merged into a single dataframe containing 5 columns:

gene_id	start_file1	start_file2	start_file3	start_file4

Within my sample set, this gives me the 22GiB dataframe I mentioned earlier. Now in this instance memory isn't too much of a problem, I'm building on a server with 128GiB RAM and Pandas merges in 65GiB RAM, however before I can pass it in to py-upset, the single dataframe needs splitting again (fan in, fan out), only to be re-joined back inside py-upset on gene_name as the unique key.

At this point it becomes very costly and even attempting to carry out a second merge cost over 128GiB RAM.

One possible approach to solving this, whilst it wouldn't solve all of the memory issues, would solve some of the performance issues, would be to allow the previously merged dataframe to be passed directly in inplace of ExtractedData.df_dict along with the column names as in_sets / out_sets, effectively providing the opportunity to move the problem earlier in the stack where it may be easier to handle, or where the merged data can be re-used in other ways.

Performance can then be improved by working column-wise across a single dataframe rather than set-wise across multiple dataframes and can be achieved by looping the combination of set names, and then selecting a row-count from the dataframe, whereby:

count = len((column in in_sets != NA) and (column in out_sets == NA))
combination = (count, in_sets)

Of course, capability of merging / indexing would still need to exist inside py-upset for those instances where a user doesn't wish to carry out a full merge ahead of the call to plot, but the onus would be then put back on the user to handle memory exceptions as they occur rather than risking an index taking several days as is presently the case.

Looking at your concern regarding subset-highlighting, again I think this can be achieved without relying on full datasets in most instances where a single merge table is used, by treating non-unique column names as being the dataset name, or again, allowing the set names be specified via an API call or as a dictionary element. E.g.

pyu.plot({'dictionary': dataset, 'names': ['file1', 'file2', 'file3', 'file4']}, unique_keys=[column_id_or_name], ... )

Regards Martin

waqarali141 commented 6 years ago

Is there been any fix for the performance, I need to plot data for the data sources containing almost 200m above records.

Any work around to this ? @mproffitt, @ImSoErgodic

zaisahil commented 6 years ago

It looks like no one is interested to actively update this repository.

ImSoErgodic / py-upset

Severe performance degredation when working with large datasets #7

Memory warning