Optimisations and thoughts

alrichardbollans commented 11 months ago

I noticed in the notebook tutorials mention that calculating the similarity takes time and have a suggestion for how this could be sped up, starting with a df, my example uses Tanimoto similarity rather than DICE but the idea is the same. Also this is split into batches to avoid using too much memory as I ran it with graphs containing ~50,000 compounds but on smaller datasets the batching may not be needed:

def get_graph_data(df_with_fingerprints: pd.DataFrame, out_file: str, batchno: int, batchsize: int = 1000):
    # Calculate Tanimoto similarity for a given batch
    # Need to batch this to avoid OOMing. Each batch will get smaller
    sources, targets, similarities = [], [], []
    all_fingerprints = df_with_fingerprints['morgan_fingerprint'].values.tolist()
    for i in tqdm(range(batchsize)): # Use tqdm to estimate time
        batch_i = i + (batchsize * batchno)
        if batch_i <= (len(all_fingerprints) - 1):
            row = df_with_fingerprints.index[batch_i]
            fingerprint_1 = df_with_fingerprints.at[row, 'morgan_fingerprint']
            source_inchi_simp = df_with_fingerprints.at[row, COMPOUND_ID_COL]
            sims = DataStructs.BulkTanimotoSimilarity(fingerprint_1, all_fingerprints[batch_i + 1:])
            # collect the ids and values
            for m in range(len(sims)):
                target_row = df_with_fingerprints.index[batch_i + 1 + m]
                sources.append(source_inchi_simp)
                targets.append(df_with_fingerprints.at[target_row, COMPOUND_ID_COL])
                similarities.append(sims[m])
    if len(sources) > 0:
        simTable = pd.DataFrame(data={'SOURCE': sources, 'TARGET': targets, 'SIMILARITY': similarities})
        simTable = simTable.reset_index(drop=True)
        simTable.to_csv(out_file + str(batchno) + '.csv')
    else:
        print(f'No data for batch number: {batchno}')

PandasTools.AddMoleculeColumnToFrame(df, 'SMILES', 'Molecule', includeFingerprints=True)
# Produce a hashed Morgan fingerprint for each molecule
df['morgan_fingerprint'] = df['Molecule'].apply(lambda x: GetMorganFingerprintAsBitVect(x, 2)) #Use apply here for better readability
for i in range(number_of_batches):
    get_graph_data(df, os.path.join(graph_data_folder, 'all_graph_data'), i)

I also had a few other thoughts that would be good to get some insight into:

What is the motivation for using Dice similarity rather than Tanimoto?
There is a line where after making the simTable, you make a matrix with stack; but in this line you also use corr(). This seems unnecessary as you've already calculated the pairwise similarity between compounds which is stored in a single column, and as far as I understand corr() calculates the correlation between pairs of columns so it's not clear to me what this is doing
As far as I understand, the methods you propose identify similarities between compounds in distinct databases and you get a metric that compares individual pairs of compounds. I am interested in whether you have any methods for quantifying how similar/related a single compound is to another dataset of compounds, and more generally quantifying how related one dataset of compounds is to another dataset of compounds
Given that you're working with graphs, I wonder if there are any graph analysis/learning techniques you could leverage the knowledge contained in the graphs in order to do some predictions e.g. which compounds are likely to be found in which species or which compounds are likely to have a particular bioactivity etc...

RicardoMBorges commented 11 months ago

Hi alrichardbollans, thank you for you message. It´s exciting. Anyway, about the script suggestion, I´ll work on it. For your points:

It was more of a choice. Dice seemed, by the time, to be the best choice, but the 2 of them are very similar.
Simply put, no. But it can be added.
One thing I´m planning to do is to modify things to make it more like a Lego thing where users can add pieces of code in the notebook to construct their own thing.
Not sure I got what you mean. But the suggestion is similar to something I´m trying to do now. I´ll keep contact.

Thanks again Ricardo

alrichardbollans commented 11 months ago

Hi Ricardo,

Thanks for your response

On point 3. yes some more functions would be great :)

Also in terms of comparing two sets of compounds, suppose you have a dataset (df1) of compounds and a reference set of bioactive compounds (df_active). One way I thought you could measure the similarity of compounds in df1 to compounds in df_active is to, for each compound in df1, find the similiarity to the most similar compound in df_active. This measures the potential of each compound in df1 to act in a similar way to some compound in df_active and gives a metric that can be used to compare various datasets (e.g. another set of compounds df2). I did write some code for myself to do this, but is a little messy and particular to my data -- but could tidy this up and send you a version if you'd like.

RicardoMBorges / DBsimilarity

Optimisations and thoughts #1