drostlab / myTAI

Evolutionary Transcriptomics with R
https://drostlab.github.io/myTAI/
GNU General Public License v2.0
38 stars 16 forks source link

Significance status of signature: NA #9

Open maxnest opened 2 years ago

maxnest commented 2 years ago

Dear Dr. Hajk-Georg Drost, Let me thank you for the awesome and very important myTAI library. In my project, I used myTAI to determine the transcriptome age indices of the life cycle stages of different flatworm species. When considering the complete transcriptomes, no problems arose, and very interesting results were obtained. However we are faced with the NaNs produced warning during PlotSignature when analyzing the “reduced” dataset, including only genes of last common ancestor of the studied species:

Plot signature: ' TAI ' and test statistic: ' FlatLineTest ' running 1000 permutations. $start.arg $start.arg$shape [1] Inf $start.arg$rate [1] Inf $fix.arg NULL Significance status of signature: NA Warning: 1: In dgamma(c(0.000122768450181325, 0.000122768450181325, 0.000122768450181325, : NaN produced 2: In stats::pgamma(real.var, shape = shape, rate = rate, lower.tail = FALSE) : NaN produced

Used code:

library(edgeR) library(myTAI) library(dplyr) library(phylotools)

fgig_phylostratr <- read.csv2("../../Phylostratr/Fgigantica/Fgigantica_phylostratr_results.tsv", sep="\t", header = T) fgig_digenea_ancestor <- get.fasta.name(infile="../../Ancestral_pyHAM/Fgigantica_100aa.Digenea_ancestor.fasta") fgig_expr <- read.csv2("../../Expr_quant/Fgig_decontaminated_salmon_quant/Fgigantica_100aa_unaveraged_TPMs.tsv", sep="\t", header = T)

fgig_exp_filtered <- subset(fgig_expr, GeneIDs %in% fgig_digenea_ancestor) fgig_exp_filtered_mut <- mutate_all(fgig_exp_filtered[, 1:length(colnames(fgig_exp_filtered))], function(x) as.numeric(as.character(x))) fgig_exp_filtered_mut$GeneIDs <- fgig_exp_filtered$GeneIDs fgig_exp_filtered_mut_mean <- data.frame("GeneIDs"=fgig_exp_filtered_mut$GeneIDs, "Egg"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_egg_rep1", "Fgig_egg_rep2", "Fgig_egg_rep3")]), "Miracidium"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_miracidium_rep1", "Fgig_miracidium_rep2", "Fgig_miracidium_rep3")]), "Redia"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_redia_rep1", "Fgig_redia_rep2", "Fgig_redia_rep3")]), "Cercaria"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_cerc_rep1", "Fgig_cerc_rep2", "Fgig_cerc_rep3")]), "Metacercaria"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_metacerc_rep1", "Fgig_metacerc_rep2", "Fgig_metacerc_rep3")]), "Juvenile_42_days"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_juv_42d_rep1", "Fgig_juv_42d_rep2", "Fgig_juv_42d_rep3")]), "Juvenile_70_days"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_juv_70d_rep1", "Fgig_juv_70d_rep2", "Fgig_juv_70d_rep3")]), "Adult"=rowMeans(fgig_exp_filtered_mut[, c("Fgig_adult_rep1", "Fgig_adult_rep2", "Fgig_adult_rep3")])) fgig_phylostratr_filtered <- subset(fgig_phylostratr, qseqid %in% fgig_digenea_ancestor)

colnames(fgig_phylostratr_filtered) <- c("GeneIDs", "MRCA", "Phylostratum", "MRCA_name") fgig_phylomap <- select(fgig_phylostratr_filtered, "Phylostratum", "GeneIDs") fgig_phyloexp <- MatchMap(fgig_phylomap, fgig_exp_filtered_mut_mean)

fgig_phyloexp_tf <- tf(fgig_phyloexp, function(x) log2(x+1)) pdf("Fgigantica_logTPMs_TAI.digenea_ancestor.pdf", width = 14) PlotSignature(ExpressionSet = fgig_phyloexp_tf, measure = "TAI", TestStatistic = "FlatLineTest", xlab="F.gigantica complex life cycle", ylab="TAI: Digenean ancestor genome model", permutations = 1000) dev.off()

As a result, on the created plot we see “p_flt = NaN”. I guess is that there is not enough observation count in fgig_phyloexp_tf (3772 observations) to run the tests. Is this possible or is there another explanation or assumption? I would be grateful for any help! Thanks a lot! Yours sincerely, Maksim

HajkD commented 2 years ago

Hi Maksim,

I am very glad to hear that you find myTAI useful for your research!

Regarding your question, usually TAI profiles should be computed on the entire transcriptome (including all genes) and not for a subset, since this may result in misleading outcomes or interpretations. On the technical side, however, computing TAI values and performing the test stats on ~4000 genes should not be the limiting step. To me the error message looks more like there is a NA value present in the input data passed internally to the FlatLineTest() function.

Could you try running:

fgig_phyloexp_tf <- na.omit(tf(fgig_phyloexp, function(x) log2(x+1)))

Within your analysis steps to see if this makes a difference?

Also, when you run the other tests: rht etc. Does the same NAN error occur?

I hope this helps!

Cheers, Hajk

maxnest commented 2 years ago

Dear Hajk, Following your comment, we decided to focus on the analysis of entire transcriptomes. After a series of tests, we assumed that the error was not related to data, but to the environment in RStudio (Windows). During the reanalysis of the data, but already in Linux, we did not find any errors and got excellent results on all our data. Thank you for your help and for developing the extremely useful and very valuable myTAI. Best wishes, Maksim

HajkD commented 2 years ago

Dear Maksim,

Thank you for your kind feedback and I am glad the issue is resolved. It seems like the fitdistrplus package that myTAI uses to perform parameter inference for fitting distributions in FlatLineTest() does not work on Windows machines. I will try to see if there is a workaround or please feel free to contact the authors of the fitdistrplus R package to see if they plan to support computations on Windows machines or not.

I hope this helps.

Cheers, Hajk