Open ntalluri opened 1 month ago
The issue is because the pathways generated by mincostflow contain duplicate edges in the pathway.txt file. This seems to occur in many of the output pathways from mincostflow. This seems to be an issue with the mincostflow code along with the ml code.
Here is an example of one the pathway.txt files with duplicate edges
Node1 Node2 Rank Direction _EGFR_HUMAN EGF_HUMAN 1 U S10A4_HUMAN EGF_HUMAN 1 U_ HDAC6_HUMAN EGF_HUMAN 1 U HS90A_HUMAN HDAC6_HUMAN 1 U KS6A3_HUMAN SRC_HUMAN 1 U SRC_HUMAN EMD_HUMAN 1 U FYN_HUMAN KS6A3_HUMAN 1 U CBL_HUMAN EGFR_HUMAN 1 U MYH9_HUMAN S10A4HUMAN 1 U EGFR_HUMAN EGF_HUMAN 1 U_ LMNA_HUMAN EGFHUMAN 1 U S10A4_HUMAN EGF_HUMAN 1 U_ HDAC6_HUMAN EGF_HUMAN 1 U GRB2_HUMAN EGF_HUMAN 1 U HS90A_HUMAN HDAC6_HUMAN 1 U CBL_HUMAN GRB2_HUMAN 1 U CBL_HUMAN EGFR_HUMAN 1 U MYH9_HUMAN S10A4_HUMAN 1 U EMD_HUMAN LMNA_HUMAN 1 U
https://github.com/Reed-CompBio/spras/pull/191 this PR contains a test case that shows the error and how the error comes up
My fix to this problem is to remove duplicate edges when the dataframes are being created before concating them together
for tup in edge_tuples:
dataframe = pd.DataFrame(
{
str(tup[0]): 1,
}, index=tup[1]
)
# drop duplicates index code
edge_dataframes.append(dataframe)
In the end, this was not a SPRAS problem other than the fact the code wasn't robust about duplicate indices in the dataframes. The true problem lies in the mincostflow code implementation.
I encountered a
InvalidIndexError
running the ml code during individual runs of parameter sweeps on mincostflow on the EGFR dataset. The issue happens in thesummarize_networks
function, related to reindexing, where non-unique index values are causing problems with pandas concatenation.Error Trace: