Wenhao-Jin / HydRA

A deep-learning model for predicting RNA-binding capacity from protein interaction association context and protein sequence
Other
6 stars 2 forks source link

PPI_feature_generation get_PPI_features and get_PPI_feature_vec #2

Open StrohD opened 11 months ago

StrohD commented 11 months ago

Unfortunately these functions are incredibly slow making the tools hard to use with PPI and PPA data. Here is a proposed change to that increases speed on order of magnitudes compared to the original implementation using python sets and removing unnecessary computations and loops in loops.

` def get_PPI_features(prot, G, RBP_set, PPI_1stNBs=None,num_cut=5): NBhood1_total=set(G.subgraph(PPI_1stNBs).nodes()) if PPI_1stNBs else set(G.neighbors(prot)) NBhood2_total=set() NBhood3_total=set()

  for nb1 in NBhood1_total:
      NBhood2_total.update(G.neighbors(nb1))
  NBhood2_total = NBhood2_total - {prot}
  for nb2 in NBhood2_total:
      NBhood3_total.update(set(G.neighbors(nb2)))
  NBhood3_total = NBhood3_total - NBhood2_total - NBhood1_total - {prot}

  RBP1=len(NBhood1_total.intersection(RBP_set))
  RBP2=len(NBhood2_total.intersection(RBP_set))
  RBP3=len(NBhood3_total.intersection(RBP_set))
  NB1=max(0.01,len(NBhood1_total))
  NB2=max(0.01,len(NBhood2_total))
  NB3=max(0.01,len(NBhood3_total))

  return {'Protein_name':prot, 
          'RBP_neighbor_counts':RBP1, 
          '1st_neighbor_counts':NB1, 
          'RBP_2nd_neighbor_counts':RBP2, 
          '2nd_neighbor_counts':NB2, 
          'RBP_3rd_neighbor_counts':RBP3, 
          '3rd_neighbor_counts':NB3, 
          'primary_RBP_ratio':RBP1/NB1, 
          'secondary_RBP_ratio':RBP2/NB2, 
          'tertiary_RBP_ratio':RBP3/NB3, 
          'RBP_flag': prot in RBP_set,
          'Reliability': (1 if NB1>num_cut else -1)}

def get_PPI_feature_vec(prot, G, RBP_set, num_cut=5, PPI_1stNBs=None):

print(prot)

  PPI_features=get_PPI_features(prot, G, RBP_set, PPI_1stNBs,num_cut)
  return np.array([PPI_features[k] for k in ['primary_RBP_ratio','secondary_RBP_ratio','tertiary_RBP_ratio','Reliability']])

`

Wenhao-Jin commented 11 months ago

Hi @StrohD , thanks for the suggestion, appreciate it. Set operations will for sure boost the program. However, there is one thing to be noticed, Set operations will lead to slightly different output as the original code which uses list to allow a protein to be counted multiple times. For example, if protein A is the protein of interest with protein B and C as its first-level neighbors while protein D is the neighbor of both protein B and C, the original code will count C twice to increase its weight when they scan the second-level neighbors of protein A. But the Set operations will not.

Therefore, if you plan to use the trained models we provided for prediction, I would recommend to use the original code to generate the PPI features to maintain consistency.

Also, if you intend to train your own PPI model (SONAR3.0) from scratch, you could definitely use the Set version to boost. Based on our previous experiments, Set version will not harm the performance too much (just slightly lower than the original classifier, though I don't recall the exact AUC values). Feel free to try if you are interested.

Once again, thanks for the suggestion! More discussions are also welcome!