Open alrichardbollans opened 11 months ago
Hi alrichardbollans, thank you for you message. It´s exciting. Anyway, about the script suggestion, I´ll work on it. For your points:
Thanks again Ricardo
Hi Ricardo,
Thanks for your response
On point 3. yes some more functions would be great :)
Also in terms of comparing two sets of compounds, suppose you have a dataset (df1) of compounds and a reference set of bioactive compounds (df_active). One way I thought you could measure the similarity of compounds in df1 to compounds in df_active is to, for each compound in df1, find the similiarity to the most similar compound in df_active. This measures the potential of each compound in df1 to act in a similar way to some compound in df_active and gives a metric that can be used to compare various datasets (e.g. another set of compounds df2). I did write some code for myself to do this, but is a little messy and particular to my data -- but could tidy this up and send you a version if you'd like.
I noticed in the notebook tutorials mention that calculating the similarity takes time and have a suggestion for how this could be sped up, starting with a df, my example uses Tanimoto similarity rather than DICE but the idea is the same. Also this is split into batches to avoid using too much memory as I ran it with graphs containing ~50,000 compounds but on smaller datasets the batching may not be needed:
I also had a few other thoughts that would be good to get some insight into:
stack
; but in this line you also usecorr()
. This seems unnecessary as you've already calculated the pairwise similarity between compounds which is stored in a single column, and as far as I understandcorr()
calculates the correlation between pairs of columns so it's not clear to me what this is doing