Calculate CLUEreg scores for new drugs

tithuytrang commented 1 year ago

Dear GRAND team, Could there be any codes/functions to calculate CLUEreg scores (Overlap, Cosine, Tau,...) for new drugs outside of your library? We have got TF signatures for the disease using PANDA & MONSTER, but would be great to check how they are reversed comparing to those of known drugs, which are not currently available in CLUEreg but we have their expression data from in vitro treatments?

Thank you for the great work on this helpful database! Trang

marouenbg commented 1 year ago

Hi again @tithuytrang 👋 I maintain some of the netzoo resources, so you will see me replying here and in the GitHub discussion or dispatching to other developers. So that is a great suggestion, we don't have it in the web version and I will add it as a feature request. Here is how you can do it on you own:

For your drug profiles, compare them to a control and derive 'differential' signatures
Derive differential signature for the disease using a PANDA/MONSTER control using the same differential method
now just compute the cosine between both differential profiles, overlap is also a very simple formula, it is described in the paper. You can get p-values by generating random signatures and computing the cosine of your drug signature to them, then rank the cosine you obtained earlier within the random distribution get a p-value and FDR correct it to get a q-value. Tau can be computed in a similar way if you use GRAND signature as a distribution instead of random signatures, I ca provide with the raw data if you're planning to get there. I hope that helps

tithuytrang commented 1 year ago

That's awesome, thanks @marouenbg! Since we want to apply the same workflow as CLUEreg, could you please help with these inquiries?

How CLUEreg built the aggregated GRN for each drug (some more description of parameters used would be appreciated too). Also, GRAND seems to have sample-specific GRNs only for drugs so would be helpful for users to get aggregated GRN for similar signature searching. Not sure if GRAND allows network contributions derived from drug-induced data? If so it would be helpful for users to have a guideline of standardised workflows used in CLUEreg.
How differential profiles of each drug were chosen (any cut-off values or top n?). I assume how these sets being chosen does affect CLUEreg scores.
When signatures being submitted to GRAND, which differential profiles being used for matching if a drug has multiple cell line treatments?

We're trying to apply calculations described in the supplementary of GRAND paper so it would be very appreciated if you could provide the raw data of GRAND signatures for Tau calculation. If you prefer private discussion, my email is truongtra@deakin.edu.au Cheers, Trang

marouenbg commented 1 year ago

Hello Trang,

How CLUEreg built the aggregated GRN for each drug (some more description of parameters used would be appreciated too). -> We used all the gene expression samples for a given drug across dose, cell, time to build a GRN for that drug

Also, GRAND seems to have sample-specific GRNs only for drugs so would be helpful for users to get aggregated GRN for similar signature searching. Not sure if GRAND allows network contributions derived from drug-induced data? If so it would be helpful for users to have a guideline of standardised workflows used in CLUEreg. -> Yes we chose to share the sample-specific networks for the database but we kept the aggregate networks for CLUEreg. We are discussing to release a nextflow of the procecessing pipeline, but this is going to take sometime since it is written in MATLAB.

How differential profiles of each drug were chosen (any cut-off values or top n?). I assume how these sets being chosen does affect CLUEreg scores. -> Yes it certainly does. The differential profiles for each drug are chosen based on a z-score>2

When signatures being submitted to GRAND, which differential profiles being used for matching if a drug has multiple cell line treatments? Cell lines, doses and time steps are aggregated sample to reconstruct a network for that specific drug. The current version of CLUEreg does not account for cell type, and like I said we are considering an expanded version that includes cell types and doses. The major challenge was the fast query of a million signature on the web but that's doable in principle.

Please let me know if I can help in any way.

tithuytrang commented 1 year ago

Hi @marouenbg, Many thanks for your answers! I wonder which metric we should prioritize among the CLUEreg output statistics (Overlap, Tau, Cosine, q-values) if we would like to rank repurposing drugs based on a single value only? Also, could you be happy to offer raw data of GRAND signatures for Tau calculation? We are trying to calculate it with the PANDA results from our in vitro treatment data.

marouenbg commented 1 year ago

Hi @tithuytrang , I found Overlap and cosine to be quite accurate, p-value and tau are just significance values for overlap and cosine. This is the Gene targeting raw data https://granddb.s3.amazonaws.com/drugs/drugNetwork/PANDA/Drugs_Gene_Targeting_AllSamples.csv and this is TF targeting raw data https://granddb.s3.amazonaws.com/drugs/drugNetwork/PANDA/Drugs_TF_Targeting_AllSamples.csv

Please let me know if you need anything else! Marouen

tithuytrang commented 1 year ago

Thanks again @marouenbg! Really appreciate your quick support. Will get back to you if I have more questions on GRAND database.

A potential feature for NetZooR overall (not sure if it's suitable to post here) I think will benefit many users is RAM-friendly option. Out-of-memory R crashes greatly bar users from exploring the potential of these great packages. Some of my colleagues opted to filtering input genes (e.g., most differentially expressed only) but this might sacrifice the robustness of networks. I modified PANDA to make it work on workstations with limited RAM, using file-backed matrices to utilise hard drive space. It was a sloppy and slow workaround but we could finish analyses without HPC. It would be great if a memory-friendly option can be integrated into official releases with your optimisation.

marouenbg commented 1 year ago

Hi @tithuytrang,

Thank you for this great suggestion, I thought about this a lot as well because we faced the same issue. I generally find that the Python implementation is best at RAM economy, and also running analyses in Ubuntu tends to consume the most RAM. I found that R+Ubuntu combination uses most RAM. As potential solutions, we thought about using a C implementation which should use very little RAM and can be bound to R/Python/MATLAB through binary routines but the old C implementation we have (https://github.com/netZoo/netZooC) is not optimal at all. So coming with a more modern C implementation can be the solution here.

QuackenbushLab / grand

Calculate CLUEreg scores for new drugs #11