cbg-ethz / nempi

Nested Effects Models based Perturbation Inference - https://doi.org/10.1093/bioinformatics/btab113 - https://bioconductor.org/packages/nempi
http://bioconductor.org/packages/nempi
2 stars 1 forks source link

Replication Data Availability #2

Closed axeljeremy7 closed 6 months ago

axeljeremy7 commented 8 months ago

Can you provide a real scRNA dataset and Cancer TCGA dataset to replicate, because the other folder seems to be working for custom users inside your organization.

MartinFXP commented 8 months ago

As described in the article, the scRNA data is available from the gene expression omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90546. The TCGA data set is available, as described in the TCGA.r file, through the TCGAbiolinkspackage at bioconductor. Most referrals to local files are to the (back then) development versions of the bioconductor packages mnemand dce.

axeljeremy7 commented 8 months ago

I have access to GSE90546 ,but the transformation before is used on nempi i think is missing, and there is 3 datasets not 6 as the paper mentioned.

MartinFXP commented 8 months ago

The paper actually mentions five perturbseq datasets: pilot, main and epistasis 1-3. The files on GEO contain three data sets because the epistasis data is included in a single set and has to be split afterwards. In the paper we split the main study into two analyses with 10 and 15 p-genes, respectively. Hence, six analyses based on actually three perturbseq datasets. But that is semantics.

The normalisation/transformation is described in the paper (plus supplement), and code, which, apart from the need of editing the folder structure, is readily available from the perturbseq.r file in the other folder. Latest R and R package versions could make more code updates necessary.

axeljeremy7 commented 8 months ago

Ok but is there anyway you can provide like html as you did on the simulation but with real data input because i’m not able to run it and I think the files are missing when executing or be more specific please . My goal is to test with other neural networks and gradient boosting but the input and running part is not clear so far , and I’m more familiar with python than R.

also, for the simulation part the data being passed to the nempi function is not strictly a binary effects matrix or a log odds matrix, the transformation applied to the original binary data by adding noise is not around 1 and -1 , so could be seen as a heuristic to simulate log odds?

MartinFXP commented 8 months ago

I am sorry, but there is no respective html. To be more specific:

  1. Download and untar AND unzip the GEO data into "path/"
  2. Then you need to set the parameters: e.g., data <- "perturbseq2"; set <- 2; path <- "path/"; subset <- 5. I think set is not actually necessary to compute the log odds. subset can be 1,2,3,4 or 5.
  3. You can then run the perturbseq.r script. You do need to make sure you have downloaded all necessary packages. You do not need to source the mnem files and can just load the current Bioconductor package with library(mnem). The dce package is also available at Bioconductor.
  4. Beware that no matter what parameters you give, the result for data="parturbseq2 is saved in file = paste0(path, "Counts.rds"). You need to either change that in the script or rename after every run.
  5. Run again with data="lods".

That should work.

Yes the simulated data is not strictly a log odds matrix but more of a heuristic to get close to realistic log odds.

axeljeremy7 commented 7 months ago

I have this completed saveRDS(M, file = paste0(path, "Counts.rds")) for "perturbseq2" but I do not have "Counts_Main.rds", "Counts_Epistasis.rds", "Counts_Pilot.rds", and why the renaming. Also, how for the "perturbseq_nempi.r", why use set 5 or the others , now I am assuming :

cat("PART 2", "\n")
saveRDS(M[[1]], file = paste0(path, "Counts_Main.rds"))
saveRDS(M[[2]], file = paste0(path, "Counts_Epistasis.rds"))
saveRDS(M[[3]], file = paste0(path, "Counts_Pilot.rds"))

cat("PART 3", "\n")
genes <- rownames(M[[1]])
for (i in 2:3) {
    genes <- intersect(genes, rownames(M[[i]]))
}
for (i in 1:3) {
    M[[i]] <- M[[i]][genes, ]
}
M <- do.call("cbind", M)
gc()
saveRDS(M, file = paste0(path, "Counts.rds"))
cat("END", "\n")
MartinFXP commented 7 months ago

Yes you are right. Sorry it has been a while since I did this. However, I think it is actually

saveRDS(M[[1]], file = paste0(path, "Counts_Pilot.rds"))
saveRDS(M[[2]], file = paste0(path, "Counts_Epistasis.rds"))
saveRDS(M[[3]], file = paste0(path, "Counts_Main.rds"))

I added that to the script.

axeljeremy7 commented 7 months ago

ok, but when is read it on the other file lods <- readRDS(paste0(path, "L_linnorm_", set, "_sc.rds")), can you explain or hint what sets and edger means? and does LOOCV only is for other dataset TCGA not for perturbseq data? if TCGA is fine to run too?

MartinFXP commented 7 months ago

For the article we used data <- "lods" to normalize the count data with linnorm and use the ks package to compute the cumulative distribution. This distribution was used to estimate log odds. In the end for each of the five data sets you get a linnorm file like lods <- readRDS(paste0(path, "L_linnorm_", set, "_sc.rds")).

The linnorm computation takes some time so you can set subset <- 1:5 or any subset of 1 to 5. For example to run all five set in parallel and not sequential.

data <- "edger" uses the package edgeR to compute an alternatives to the log odds.

Yes, we did not perform loocv on the perturbseq data.

axeljeremy7 commented 7 months ago

Also, for the epistasis when i =2, in do gene there are case where is Unknown and lenght is 1 so kds fail, so i have skip those cases but it makes me question if the previous scripts are fine, so far for i =1 and i =3 it was able to run fine without modification.

MartinFXP commented 7 months ago

Looks like you are on the right track.

The boxplot function in the mnem package was updated and renamed to moreboxplot.

Sorry, but I do not have the original files anymore and I remember that it was a very demanding computation.

all the files are Counts_Epistasis related even though there are other types

I don't understand that.

I am assuming to ignore since is creating, but keeping the logic to know when is greedy and exhaustive giveb this one come from Counts_Pilot_M1.rds

This is also a sentence I do not understand.

axeljeremy7 commented 7 months ago

Or for this "file <- paste0("nempiepistasis", set, "", domice, "", Pgenes, "_", unkpct, ".csv")"

The naming for all cases is epistasis which is fine based on my understanding using 1,2,3 but 4 and 5 will have the same name for set 4 and 5, so that is my doubt, is this just something to ignore due the name ? because I want to be sure the calculation are right when using set 4 and 5 with greedy when the files are main and pilot.

Also, I am using m1 mac so the chip is arm not intel x96, and the cpp files failed to compile, has you encountered something like this because even after manually call the source files still fails.

MartinFXP commented 7 months ago

Yes the epistasis is not really descriptive here. The main point is the set parameter, which decides whether the main (4), pilot (5) or epistasis (1-3) studies are analyzed.

I would suggest to use the official mnem package, if you not already are, available from bioconductor.org here with official support for mac apparently including arm. Personally, I only run windows and linux.

axeljeremy7 commented 7 months ago

does the file for tcga contains leave-one-out cross-validation for the classification part, and to create the plots of causal bayesian network for the previous data, some case on the casual netwok is ok but the heatmap is not nice to see, do you remeber something like this?

MartinFXP commented 7 months ago

The leave one out (loo) analysis is not in the TCGA files but the nempi_loo.r file. We did not plot the networks for the perturbseq data and neither did we produce heatmaps. I don't remember any "not nice to see" heatmaps.

axeljeremy7 commented 7 months ago

for the nempi, the part of "Accuracy of all methods for the CRISPR main validation study. The runs were split up into dense and sparse ground truth networks phi" where exactly is that?, because I was able to do the first part "Accuracy of all methods for the CRISPR validation study with 0.1 and 0.9 unlabelled cells", so I appreciate any insight in the code because I don't see a clear code connection.

Also, I have this output and it match I think I forgot to mentioned you, I have doubts on the meaning of these outputs, because I am using part2 not part1 to match your plot and I hope I'm right.

Screenshot 2023-11-30 at 1 54 19 AM above from page 7 Figure S8:

Gamma2 <- t(mytc(nres$res$adj)) %*% nres$Gamma
nacc <- acc(Gamma2, Gamma)
get_Gamma_lods <- getGamma(lods)
nacc2 <- acc(nres$Gamma, get_Gamma_lods)

this is just one sample, I haven't run it 100 times yet.

> res_nempi_5$list_using_nempi_func$nacc2$auc
[1] 0.7299794
> res_nempi_5$list_using_nempi_func$nacc$auc
[1] 0.9625939
> svm_results_5$svmacc$auc
[1] 0.7521751
> svm_results_5$svmacc2$auc
[1] 0.7514976
> nnet_result_5$nnacc2$auc
[1] 0.6316206
> nnet_result_5$nnacc$auc
[1] 0.81474
> miss_fores_result_5$mfacc$auc
[1] 0.7535074
> miss_fores_result_5$mfacc2$auc
[1] 0.75
MartinFXP commented 7 months ago

When you have your saved results, you can count the edges in each network provided by the result of nempi. We simply counted the median edges over all results and split into above median (dense) and below median (sparse). No further computation is necessary.

These accuracies look fine to me and actually pretty close to the original results. If you mean nacc2with part2 then you are using the wrong model(s). nacc is the correct one. Here we compare the ground truth sample labels (with network propagation) to the inferred labels with network propagation.

axeljeremy7 commented 6 months ago

Sorry I am doing this replication part and there is no nempi part, all happen in classpi and nem which come from mnems otherl ibrary, my previous was wrapper to be organized and is fine, my question is the part to recreate expression(dense ~ phi) plot second part because it seems there is part missing to do that if not all will be the same as before, and given the unident is FALSE, thenres <- nem(D_bkup, Rho = Gamma, ...), so no need of nempi, i understand it is not needed but my doubt is as before to express the second type of plot for expression(dense ~ phi)


lods.sub <- lods
unknowns <- sample(1:ncol(lods),floor(unkpct*ncol(lods)))
colnames(lods.sub)[unknowns] <- ""
## nempi
nres <- nempi(lods.sub,search=search)
n <- nrow(phi)
phiacc <- 1-sum(abs(mytc(phi)-mytc(nres$res$adj)))/(n*(n-1))
Gamma2 <- t(mytc(nres$res$adj))%*%nres$Gamma
nacc <- acc(Gamma2,Gamma)
nacc2 <- acc(nres$Gamma,getGamma(lods))
## svm
svmres <- classpi(lods.sub)
svmacc <- acc(svmres$Gamma,Gamma)
svmacc2 <- acc(svmres$Gamma,getGamma(lods))
## random forest
rfres <- classpi(lods.sub,method="randomForest")
rfacc <- acc(rfres$Gamma,Gamma)
rfacc2 <- acc(rfres$Gamma,getGamma(lods))
## neural net
nnres <- classpi(lods.sub,method="nnet",size=2)
nnacc <- acc(nnres$Gamma,Gamma)
nnacc2 <- acc(nnres$Gamma,getGamma(lods))
## knn
train <- t(lods.sub[, which(colnames(lods.sub) != "")])
test <- t(lods.sub[, which(colnames(lods.sub) == "")])
cl <- colnames(lods.sub)[which(colnames(lods.sub) != "")]
tmp <- knn(train, test, cl, prob=TRUE)
lods.sub2 <- lods.sub
colnames(lods.sub2)[which(colnames(lods.sub2) %in% "")] <- as.character(tmp)
knnres <- list()
knnres$Gamma <- getGamma(lods.sub2)
knnres$Gamma <- apply(knnres$Gamma, 2, function(x) return(x/sum(x)))
knnacc <- acc(knnres$Gamma,Gamma)
knnacc2 <- acc(knnres$Gamma,getGamma(lods))
## missForest
mfdata <- cbind(as.data.frame(t(lods.sub)), factor(colnames(lods.sub)))
mfdata[which(mfdata == "", arr.ind = TRUE)] <- NA
sink("NUL")
mfimp <- missForest(mfdata)
sink()
lods.sub2 <- lods.sub
colnames(lods.sub2) <- mfimp$ximp[, ncol(mfimp$ximp)]
mfres <- list()
mfres$Gamma <- getGamma(lods.sub2)
mfres$Gamma <- apply(mfres$Gamma, 2, function(x) return(x/sum(x)))
mfacc <- acc(mfres$Gamma,Gamma)
mfacc2 <- acc(mfres$Gamma,getGamma(lods))
MartinFXP commented 6 months ago

there is no nempi part

You just posted the nempi part.

## nempi
nres <- nempi(lods.sub,search=search)

The classpi function computes a network based on the inferred perturbations by the various classifiers, e.g., svm. This is just an addon for the interested and is not used anywhere in the analysis.

As I have stated before, when you use nempi, you also get a network. We compute the density by counting the edges. For all runs you get a median density. Every run above that is called dense, and below the median is called sparse. Then you make the same plots, but instead over all runs, once for the dense runs and once for the sparse runs. And again, no further computation is necessary.