Reed-CompBio / spras

Signaling Pathway Reconstruction Analysis Streamliner (SPRAS)
MIT License
11 stars 20 forks source link

Omics Integrator 2 Testing Error #133

Open ntalluri opened 11 months ago

ntalluri commented 11 months ago

When I was writing the tests for parse outputs, I kept getting an error on this line of the code for oi2: df = df[df['in_solution'] == True] # Check whether this column can be empty before revising this line

I was taking the output from the oi2 test suite, but it was failing because it was missing the 'in_solution' column, yet the column 'in_solution' appears in the raw-pathways for oi2 when SPRAS is run over the datasets.

There might be an issue with the test suite or merely the result. I haven't checked it out, but it's something we should look at because it's not a current test.

agitter commented 10 months ago

This may not be a problem with the Omics Integrator 2 container or run function but rather have to do with its input data and parameters. I modified the test function to use the parameters (b=4, g=0) and input data from the workflow. I was able to get a complete output file with the in_solution column when running the test_oi2_required test:

protein1    protein2    cost    in_solution
B   A   0.52    True
B   C   0.73    True

I'll need to run more tests or read through its source code to understand why this column is sometimes missing.

agitter commented 10 months ago

I pushed the branch oi2-misssing-col with my initial test dataset changes and explorations and some documentation updates.

I had started looking at where Omics Integrator 2 sets in_solution in its source code, but didn't set find anything that helps explain this behavior: https://github.com/agitter/OmicsIntegrator2/blob/command-line/src/graph.py#L335

I was testing pytest -k test_oi2_required.

ntalluri commented 1 week ago

When I was grid searching the parameters for the EGFR dataset, oi2 kept breaking due to this error. For parameter tuning, it will be helpful to find a solution for this issue soon.

ntalluri commented 1 week ago

Examples of headers from the raw pathway file

protein1 protein2 cost

protein1 protein2

protein1 protein2 in_solution cost

protein1 protein2 cost in_solution

ntalluri commented 1 week ago

quick fix, if the first 3 column headers are there, then write an empty file and say it is corrupted

then try to figure what the code is doing

agitter commented 1 week ago

I resumed investgating the in_solution column. All of the following is speculative and needs to be tested and confirmed.

Omics Integrator 2 adds that information here using nx.set_edge_attributes. We use version 2.1 of networkx so the source code is here. My initial read suggests the graph edge attributes are dicts. Maybe the order of the attributes is not guaranteed? That could explain why cost and in_solution are sometimes swapped in order. We can work to confirm this idea.

In addition, if Omics Integrator 2 adds in_solution only for forest edges, that suggests the edge attribution is not added when forest is empty. We can also test that behavior. My initial testing in the branch above may support that theory.

None of this explains why cost may be missing sometimes. Does this only happen when the raw pathway is empty?

ntalluri commented 3 days ago

https://github.com/Reed-CompBio/spras/pull/182 Pull request for error

ntalluri commented 2 days ago

The cost is always missing when the raw pathway is empty.

ntalluri commented 17 hours ago

For the in_solution and cost swapping situation, this might be a problem with the version of python used as well. For python 3.7 and newer, dictionaries are guaranteed insertion order. For 3.6, it is not officially guaranteed by the language specification that insertion order is guaranteed. So, there is a chance that Python 3.6 could be potentially changing the order.

ntalluri commented 17 hours ago

Based on my understanding, if the forest is empty, the code will attempt to iterate through the edges, but since there are none, no exception will be raised, and the function will simply end without making any changes. Therefore, in_solution will never be added as an edge attribute.