cytoscape / py4cytoscape

Python library for calling Cytoscape Automation via CyREST
https://Py4Cytoscape.readthedocs.io
Other
69 stars 15 forks source link

Bottleneck in `load_table_data` #93

Closed ConsueTerra closed 2 years ago

ConsueTerra commented 2 years ago

Hello I am working on a very large network and I noticed a serious bottleneck during runtime when loading data onto the edge table.

Here is some example code I ran, here I was fetching the alignment_score from the edge table, taking the exponential and adding the column back in as exp_score. On a table size of 200-400k the runtime is not too bad (maybe 5-15 min) but beyond 1m the runtime really slowed down, on this example for a table size of 1.35 m my code was running for 158 min on a jupyter notebook before I gave up and interrupted the kernel.

edge_df = p4c.get_table_columns(table = 'edge')

exp_scores =10 ** (-1 * edge_df['alignment_score']).to_frame()
exp_scores['SUID'] = edge_df['SUID']
exp_scores = exp_scores.rename(columns= {'alignment_score' : 'exp_score'})
p4c.load_table_data(exp_scores,table = 'edge',table_key_column='SUID',data_key_column= 'SUID')

The terminal halted on this filter and I suspected on a table of 1m that it stayed over 2h just trying to filter.

https://github.com/cytoscape/py4cytoscape/blob/cc3186c722f2789460b90547105d7ffa87d0460f/py4cytoscape/tables.py#L461-L462

However, considering that the filter essentially compares two Series wouldn't be faster to use pd.Series.isin(values) instead? I haven't run extensive runtime tests but running exp_scores['SUID'].isin(exp_scores.reset_index()['SUID']).value_counts() retuned in less than a second for the same table.

bdemchak commented 2 years ago

Hi, Consuelo --

First, thanks for using py4cytoscape ... it's great to have users like you push the edges. We'll try to work this out for you ASAP.

I think your diagnosis is pretty close. It's been a while since this code was written, so I have to take a long look at it. Here's what I see:

We assume that table_key_column is the name of a column in the table, and isn't None or a list of columns. [We really need to verify this property.]

Given that, table_keys is an ndarray (multi-dimensional array), not a Series. And after that, the "filter" code runs in time len(table_keys) x len(data[data_key_column]), which is terrible if both lengths are 10^6. This grows exponentially, which is a big no-no ... with apologies.

I agree that if we have a Series.isin instead, we could get very good speed improvement.

I wonder whether converting both table_keys and data[data_key_column] to set() and then performing an intersection would produce a great result, too.

Thinking about this, it seems likely that .isin is doing just that, but at kernel speed, which would be better than building two sets and performing a set() intersection.

I note that the "found" pattern exists in multiple places in py4cytoscape, so this is a good opportunity to make improvements in several functions.

I'll be glad to do this in the next few days in the 1.6.0 branch, which isn't released yet.

What do you think??

bdemchak commented 2 years ago

Hi --

I have updated the load_table_data() function to use Series.isin() as suggested. This seems to reduce running times for large networks substantially. See dbc0d6f3596b91e8890b221ace971513e80714d1

Would you mind checking this in Branch 1.6.0 to see how it goes? This is pre-release code, and suggestions are appreciated.

Thanks!

ConsueTerra commented 2 years ago

Hello, it took a while to regenerate my dataset to test on but after checking out the branch and testing the results seemed promising. Runtime on a table with 1.2 M rows dropped from 2h+ to 47.3 seconds. An interesting thing to note is that get_table_columns when called to fetch table_key_column takes 10.7 seconds. When getting the whole table of 8 columns the runtime is 1m 26.5s.

bdemchak commented 2 years ago

Thanks for this feedback ...

Good news on the 1.2M runtime ... We'll call that a success (with apologies).

Regarding get_table_columns(), the runtimes make sense. If one column takes 11 seconds, then 8 columns would take 88 seconds (more or less) and that's what you're reporting. There is a little bit of startup overhead, but then each column is fetched from Cytoscape and added to the returned DataFrame. I suspect the time for each column is dominated by adding it to the DataFrame, which is done element-by-element using a loop and the .at[] operator. (This has got to be the slowest way of doing this.)

This mirrors the original R implementation. However, I suspect that the isin() lesson applies here, too ... this might go a lot faster if we fetch each column into a DataFrame and then merge the column values by using the .merge() function.

Also plausible, it turns out that the entire column set can be fetched into a DataFrame by using the CyREST /v1/networks/{netid}/tables/{tabletype} function ... it returns an array of rows (one dict for each node or edge), and can be easily converted to a DataFrame via a Pandas from_dict() operation. From there, the unwanted columns could be easily dropped. Of course, this might chew a ton of memory ... maybe.

I don't have a 1M network, but would be willing to experiment with a few approaches if this timing is important to you. I'd need to rely on you to give me timings once I have something that works.

Are you interested?

bdemchak commented 2 years ago

Hi, Consuelo --

FYI, I changed Branch 1.6.0 just now ... it now does get_table_columns() much faster.

If you have time, would you mind trying your large network to see if you can measure better execution times?

Thanks.