Reed-CompBio / spras

Signaling Pathway Reconstruction Analysis Streamliner (SPRAS)
MIT License
11 stars 20 forks source link

InvalidIndexError when running ml code #188

Open ntalluri opened 1 month ago

ntalluri commented 1 month ago

I encountered a InvalidIndexError running the ml code during individual runs of parameter sweeps on mincostflow on the EGFR dataset. The issue happens in the summarize_networks function, related to reindexing, where non-unique index values are causing problems with pandas concatenation.

# initially construct separate dataframes per algorithm
    edge_dataframes = []
    # the dataframe is set up per algorithm and a 1 is set for the edge pair that exists in the algorithm
    for tup in edge_tuples:
        dataframe = pd.DataFrame(
            {
                str(tup[0]): 1,
            }, index=tup[1]
        )
        edge_dataframes.append(dataframe)

    # concatenating all the algorithm-specific dataframes together
    # (0 is set for all the edge pairs that don't exist per algorithm)
    concated_df = pd.concat(edge_dataframes, axis=1, join='outer')
    concated_df = concated_df.fillna(0)
    concated_df = concated_df.astype('int64')

Error Trace:

RuleException:
InvalidIndexError in file /Users/nehatalluri/Desktop/research/spras/Snakefile, line 315:
Reindexing only valid with uniquely valued Index objects
  File "/Users/nehatalluri/Desktop/research/spras/Snakefile", line 315, in __rule_ml_analysis
  File "/Users/nehatalluri/Desktop/research/spras/spras/analysis/ml.py", line 85, in summarize_networks
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/util/_decorators.py", line 331, in wrapper
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 381, in concat
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 612, in get_result
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3904, in get_indexer
ntalluri commented 4 weeks ago

The issue is because the pathways generated by mincostflow contain duplicate edges in the pathway.txt file. This seems to occur in many of the output pathways from mincostflow. This seems to be an issue with the mincostflow code along with the ml code.

Here is an example of one the pathway.txt files with duplicate edges

Node1 Node2 Rank Direction _EGFR_HUMAN EGF_HUMAN 1 U S10A4_HUMAN EGF_HUMAN 1 U_ HDAC6_HUMAN EGF_HUMAN 1 U HS90A_HUMAN HDAC6_HUMAN 1 U KS6A3_HUMAN SRC_HUMAN 1 U SRC_HUMAN EMD_HUMAN 1 U FYN_HUMAN KS6A3_HUMAN 1 U CBL_HUMAN EGFR_HUMAN 1 U MYH9_HUMAN S10A4HUMAN 1 U EGFR_HUMAN EGF_HUMAN 1 U_ LMNA_HUMAN EGFHUMAN 1 U S10A4_HUMAN EGF_HUMAN 1 U_ HDAC6_HUMAN EGF_HUMAN 1 U GRB2_HUMAN EGF_HUMAN 1 U HS90A_HUMAN HDAC6_HUMAN 1 U CBL_HUMAN GRB2_HUMAN 1 U CBL_HUMAN EGFR_HUMAN 1 U MYH9_HUMAN S10A4_HUMAN 1 U EMD_HUMAN LMNA_HUMAN 1 U

ntalluri commented 4 weeks ago

https://github.com/Reed-CompBio/spras/pull/191 this PR contains a test case that shows the error and how the error comes up

ntalluri commented 4 weeks ago

My fix to this problem is to remove duplicate edges when the dataframes are being created before concating them together


for tup in edge_tuples:
       dataframe = pd.DataFrame(
           {
               str(tup[0]): 1,
           }, index=tup[1]
       )
      # drop duplicates  index code
       edge_dataframes.append(dataframe)
ntalluri commented 3 weeks ago

In the end, this was not a SPRAS problem other than the fact the code wasn't robust about duplicate indices in the dataframes. The true problem lies in the mincostflow code implementation.