cytoscape / py4cytoscape

Python library for calling Cytoscape Automation via CyREST
https://Py4Cytoscape.readthedocs.io
Other
69 stars 15 forks source link

Remove n^2 loops #94

Open bdemchak opened 2 years ago

bdemchak commented 2 years ago

There are a number of functions that contain checks for a node or edge list being contained within a network. For very large networks and very large lists, this becomes an n^2 operation. I recently verified that for a million node network, the check in load_table_data() took longer than 2 days (before I gave up).

The general pattern is to create a comprehension and then test whether True or False is in the comprehension.

For example:

    test_present = [x in all_names   for x in node_suids]
    if not False in test_present:
        return node_suids

These can be found in:

    py4cytoscape_utils:node_suid_to_node_name()
    py4cytoscape_utils:edge_suid_to_edge_name()
    py4cytoscape_utils:_item_to_suid()
GeneCodeSavvy commented 6 days ago

I was considering an approach where we first convert the data into a dictionary for SUID to name mappings, and store all names in a set. The dictionary would provide O(1) average-time lookups for SUIDs, and the set would also offer O(1) average-time lookups for names. After setting up these structures, we could process the input list by checking whether each item is a SUID or a name, thereby minimizing the need for repeated linear searches. This approach would allow us to process the input list in O(M) time, where M is the number of SUIDs. Should I create a PR to the branch 1.10?

Example: node_suid_to_node_name() Screenshot 2024-09-07 at 13 00 23

Reference: https://wiki.python.org/moin/TimeComplexity

bdemchak commented 3 days ago

@GeneCodeSavvy

Hi, Harsh ...

This looks plausible ... good work.

To test this, you'd need to do three things:

1) Create a very large test case that makes the original code look slow, and then show that your new code has substantial improvement.

2) Execute the test_py4cytoscape_utils:test_node_suid_to_node_name() unit test (or all of the test_py4cytoscape_utils tests) to verify sanity and reasonable edge cases. To execute tests, start by looking over deep testing.

The quick and easy test would be ./runalltests.bat test_py4cytoscape_utils.py. Remember to have started Cytoscape before running the test.

3) Execute the entire test suite (./runalltests.bat). This will take a LONG time, and a few errors might pop out, especially around the apps tests. To get a feel for "normal" errors, you might run the suite before making your changes, and then after.

Feel free to ask questions as they arise. You're on a good track.

GeneCodeSavvy commented 2 days ago

@bdemchak I'll get started on it right away.

GeneCodeSavvy commented 1 day ago

@bdemchak Hi Barry, I was having some minor issues with test cases. I have figured out the problem. Sorry for not updating you about this sooner. I will finish up and ensure everything works as expected, one last time. Thank you for your patience!

bdemchak commented 1 day ago

@GeneCodeSavvy Thanks, Harsh ... I'm the only one that ever runs those tests, so it's easy to believe that there would be issues when run by someone else. Just curious ... what did you run into? Maybe we can smooth some rough spots over?

GeneCodeSavvy commented 1 day ago

@bdemchak Apologies for the confusion. The issue wasn't with the tests themselves. The optimized py4cytoscape_utils:node_suid_to_node_name() was failing the following test:

suids_with_name = suids.copy()
suids_with_name.append('YBR043C')
self.assertRaises(CyError, node_name_to_node_suid, suids_with_name)

In hindsight, this was straightforward, but it took me longer than expected to resolve.

GeneCodeSavvy commented 1 day ago

image

I created a very large DataFrame with 1 million nodes, each having a unique SUID and randomly generated name. I also generated test lists of varying sizes (1,000, 10,000, 100,000) by sampling random SUIDs from the DataFrame. The optimized version completed in under 0.24 seconds for the largest test size, while the original version took over 1,145 seconds (nearly 20 minutes) for the same size.

Screenshot 2024-09-13 at 05 08 47

Screenshot 2024-09-13 at 05 02 29

Screenshot 2024-09-13 at 05 03 47

As for the issue, I was encountering - I add a two checks are_all_suids = all(isinstance(item, int) for item in node_suids) and are_all_names = all(isinstance(item, str) for item in node_suids). Got this idea from the normalize_list() function.

Screenshot 2024-09-13 at 05 04 36

bdemchak commented 4 hours ago

@GeneCodeSavvy

Thanks, Harsh ... this is promising ... I like the performance and you have good ideas.

Can you package this as a pull request into the 1.9.0 branch (not master)?

It should have two parts:

You'll want to insert your large dataframe tests into the applicable unit tests (e.g., test_node_suid_to_node_name(), test_node_name_to_node_suid(), test_edge_suid_to_edge_name(), test_edge_name_to_edge_suid())

The test changes aren't so interesting and don't contribute to the actual py4cytoscape library ... but they serve two critical functions. First, to keep us honest that we've actually tested all of the cases. Second, to give us confidence that future library modifications don't break existing py4cytoscape functionality. You've already seen the benefit of this when you discovered that your original code was missing a check for a mixed list of SUIDs and names.

Since this is your first time through this process, I'd expect that we'll likely iterate on the pull request before it's suitable for integration.

If we can get to that point, we'll include it in the 1.10.0 release, which can happen right after the pull request is in good shape.

OK?

(Well done!)