string protein query gives errors when there are too many proteins

mehmetdirenc commented 1 month ago

Hi, I have around 3.5k proteins that i want to query from py4cytoscape. When I limit my input to 500 proteins it works fine however when I dont, I get URI Too Long error. Is there a way to get around this issue? Here's the code I use below. Thanks for any help!

exp_str = ','.join(str(v) for v in filtered_df["gene_id"])
string_cmd_list = ['string protein query','query="',exp_str,'"', 'species="Mus musculus"', 'cutoff=0.4']
string_cmd = " ".join(string_cmd_list)
p4c.commands.commands_run(string_cmd)
p4c.create_subnetwork(edges='all', subnetwork_name='%s pynetwork'%exp_name)

In commands_get(): <h1>Bad Message 414</h1><pre>reason: URI Too Long</pre>

AlexanderPico commented 1 month ago

@scootermorris What's the recommendation for use cases like this?

bdemchak commented 1 month ago

@mehmetdirenc Thanks for your question. It appears that this error may be coming from the String application. We have asked the String maintainer/guru for information. We'll get back to you as soon as we can.

mehmetdirenc commented 1 month ago

Thank you both for the swift replies!

bdemchak commented 2 weeks ago

@mehmetdirenc Hi ... I have tried this with Cytoscape 3.10.2, stringApp v2.1.1 and py4cytoscape 1.8.0. I don't have exactly the network you're working with, but I have no problem passing 500 terms to "string protein query" and then getting subnetworks. Can you give me a self-contained example so I can investigate further? Thanks.

GeneCodeSavvy commented 2 weeks ago

Hi, I have around 3.5k proteins that i want to query from py4cytoscape. When I limit my input to 500 proteins it works fine however when I dont, I get URI Too Long error.

@mehmetdirenc is working with around 3.5k proteins that he wants to query using py4cytoscape. When he limits his input to 500 proteins, the query works fine. However, when he tries to query all 3.5k proteins at once, he encounters a "URI Too Long" error. I am also working with a similar number of proteins and face the same issue. You can easily replicate this error querying more than 500 proteins.

bdemchak commented 2 weeks ago

@GeneCodeSavvy Hi ... sorry to be unclear. I can certainly investigate the py4cytoscape and CyREST links, but I have no expertise with String, This use case seems legitimate and important, and I'm happy to pursue it ... however, I can't generate the network myself. Can you generate the network and a self-contained test script? If you can get me that far, I can take it from there. Thanks.

bdemchak commented 2 weeks ago

@mehmetdirenc @GeneCodeSavvy

Hello ... I have worked on this and can report the following:

The reason this is failing is that commands_run()uses an HTTP GET, which imposes a limit on the number of characters in a command/parameter list. Depending on the browser, the limit could be 2048, 8192 or something else of the same order. That's why the URI Too Long message is being returned. So, Stringnever even saw this command ... the browser stopped it.

A good workaround is to use commands_post(), which accepts the same parameter, but has no limit on the command size.

The drawback is that commands_post() returns the SUIDof the new network (e.g.,{'SUID': 856235}) whereas commands_run()returns something more pleasing to the eye (e.g., ["Loaded network 'STRING network - abc' with 11 nodes and 42 edges"]).

In truth, the commands_post() SUIDmay be more useful to your Python script, as it allows you to directly access the resulting network. If you want to access the network produced by commands_run(), you'd have to parse out the 'STRING network - abc' and then use it instead of an SUID.

Either commands_post() or commands_run()leave the new network as the "current" network, so the return value may be moot in any practical application.

Can you try commands_post()instead of commands_run() and let me know how this works?

cc: @scootermorris @AlexanderPico @yihangx

GeneCodeSavvy commented 1 week ago

Hello @bdemchak ,

I have successfully tested the workaround by using p4c.commands.commands_post(string_protein_query_cmd), which resolves the issue with URI length limits.

Given that commands_post() does not have the same limitations as commands_run() and provides a more reliable way to handle large queries, could we consider incorporating this logic directly into the library? Automating this fallback mechanism could enhance user experience and streamline operations for cases where large commands are involved.

bdemchak commented 1 week ago

@mehmetdirenc @GeneCodeSavvy

Great, Harsh ... thanks for the report.

Your suggestion is very reasonable and I'd be taking it if this were early in the py4cytoscape development.

The values returned by HTTP GET (i.e., commands_run()) and HTTP POST (i.e., commands_post()) are very different (e.g., ["Loaded network 'STRING network - abc' with 11 nodes and 42 edges"] vs {'SUID': 856235}). So, simply changing commands_run() to use HTTP POST would solve the immediate problem, but would significantly alter the return value.

It's pretty easy to argue that no application depends on the commands_run() return value, so changing it would cause no harm and have only benefits.

I can't make that argument, though, because py4cytoscape has been out for a couple of years now, and I don't know how commands_run() is actually being used, or whose application would break if this change were to be made.

For a more rigorous discussion of this, our project follows semantic versioning, which instructs us that any breaking change to the API would require us to bump the major version (i.e., 2.0.0 instead of 1.10.0) as a warning to everyone that their py4cytoscape application may break. For such a small change, a major version bump would be hard to justify.

Instead, I propose both a documentation improvement and improvement to the exception message you reported. The idea would be to steer a commands_run() victim to commands_post() early and with minimal time hit.

As it turns out, we're near the end of the v1.10.0 checkout, and such a change would be easy to make.

Agree?

cc: @scootermorris @AlexanderPico @yihangx

GeneCodeSavvy commented 1 week ago

Hi @bdemchak,

Thank you for your thoughtful response! I completely agree with your approach. Enhancing the exception message and improving the documentation to guide users towards commands_post() when encountering a "URI Too Long" error seems like a practical solution, especially given the potential impact on existing applications.

I'd be happy to work on this enhancement if that works for you. My plan is to add an if...else statement in the commands_get() function's try...except block, like this:

except requests.exceptions.RequestException as e:
    # Check if the error is due to the URI being too long
    if e.response.status_code == 414:
        # Provide a more informative error message
        raise CyError(f"URI Too Long: The command you attempted to execute is too large to be handled via GET request. "
                      f"Consider switching to commands_post() for larger queries.")
    else:
        _handle_error(e, force_cy_error=True)

Additionally, I'm happy to assist with updating the documentation. If you could provide some guidance on how to get started with that, it would be greatly appreciated.

Thank you.

bdemchak commented 1 week ago

@GeneCodeSavvy Thanks! I have already gotten this coded and with test suites on my end. I'll check it in tomorrow.

I took a slightly different and more focused approach. The idea is to avoid messing up any caller that might depend on a result. As I was working on this, I realized that this would apply even to changes relating to the URI Too Long error.

So, I (of course) changed the documentation. I also changed commands_run() so that it issues a "narrate" for this case. A "narrate" is output to the console that doesn't affect what the Python function returns. So, if you're in a Notebook, you'll see the suggestion ... and any code that relies on the old behavior will still work.

Your suggestion about having this check in commands_get() is sounding better and better to me, though ... I might relocate the narrate as you suggest. Thanks for the idea!

GeneCodeSavvy commented 1 week ago

@bdemchak

Thank you for acknowledging my suggestion! I’m glad to hear the update is coming together smoothly.

I'd love to assist with any future updates or ideas you might have. This library is fantastic, and being able to contribute will be a rewarding experience for me. Thanks again for the opportunity!

Best regards, Harsh

bdemchak commented 1 week ago

@GeneCodeSavvy Hi, Harsh ... this change has been committed in branch 1.10.0 ... care to take a look to see what I did?? Take note that the test suite is very substantial, and provides are large part of the value of this package. I'd say that as a portion of the entire project, the py4cytoscape code is about 20%, the documentation is another 20%, and the tests are about 60%. This is fairly typical of a mature, well-executed project, and py4cytoscape strives to be that.

GeneCodeSavvy commented 1 week ago

@bdemchak

Hi Bary, Thank you for adding me to the contributors list; I’m truly honored. I’ve reviewed the changes in branch 1.10.0, particularly the addition of the base_url parameter within various functions, which enhances overall functionality. The documentation updates and tests for handling outsize commands_run and commands_get requests further solidify the library's reliability.

I’m curious the expected timeline for merging and releasing this branch as an updated library version. It would be exciting to see these improvements go live.

Thanks again for the opportunity to contribute!

Best regards, Harsh

bdemchak commented 1 week ago

Hi, Harsh --

I'm glad you got a look at this ... it seems like you have no problem with the changes.

If so, I'll go live with it today or tomorrow. v1.10.0 has had lots of time to cure.

I'm happy to see you work on this project, though I think it's fairly mature. I coordinate py4cytoscape changes with the RCy3 group so we can keep our APIs in sync and take best advantage of what we learn.

I'm happy to introduce you to the RCy3 group. Can you tell me more about yourself? Particularly, what your biological and computational interests and experiences are.

Thanks!

BTW, it's OK to shift this conversation to e-mail ... mine is bdemchak@ucsd.edu

GeneCodeSavvy commented 1 week ago

@bdemchak, Thank you for the message! I'll follow up via email shortly (hsharma.biotech@gmail.com).

cytoscape / py4cytoscape

string protein query gives errors when there are too many proteins #137