patternTest on separate fitGAM objects

Alexis-Varin commented 3 months ago

Hello, We have an object with 2 conditions and we selected 2 lineages from our 7 lineages and have extracted the pseudotime and cellweights for each. We filtered our genes to only include variable as well as deviant genes, ending up with about 5k genes in total. We then run fitGAM() for each lineage on cluster computing, which proved to be challenging even with the amount of ressources we had but still managed to complete the runs (about 25k cells), seperating the lineages was needed (more on that in another issue I will soon open).

My question is, how can we merge both SCE objects obtained at the end of the computation corresponding to each lineage to run patternTest() ? I see that we can input a list which are the non-SCE results, where are the fitGAM() data stored in the SCE object if I would like to extract them and submit them to patternTest() as a list ? We ran conditionTest() fine on each object but would also like to compare the DEG that would govern the choice between one or the other lineage (both lineages are at a branch point).

Thanks !

HectorRDB commented 3 months ago

Hi again @Alexis-Varin

I would not recommend doing this separately: the smoother will be better if they can estimate the gene variance from each gene using all cells. If runtime is an issue, I would recommend downsampling the dataset or looking into meta cells to have a smaller dataset. This way, you can run all tests (patternTest, conditionTest and the other gene level tests) on the same object.

Also, what do you mean by "both lineages are at a branch point" ? Did you restrict to the parts after the branching ?

Alexis-Varin commented 3 months ago

Hi, We indeed had runtime issues (several days for evaluateK on nGenes = 200...) which forced us to only focus on a single lineage and filter our genes to about 5k (out of 40k, because we have a lot of ncRNA) but after many tries and the help of our cluster computing IT team I discovered that we had a parallelization problem which overloaded the Slurm node (I might open an issue on BiocParallel github or here, basically for X workers in MulticoreParam() in BPPARAM and X cpus per task in slurm batch it created X processes (PID) with X threads/cores (SPID) each (so X² cores running) instead of creating 1 process with X threads/cores, this caused a severe conflict and slowdown of the computation, I solved it by setting just 1 worker in MulticoreParam() in BPPARAM of fitGAM() and setting 36 cpus in my slurm batch, a single PID is created with 36 cores and the functions runtime went from several days to a few hours, I suspect this behavior is not specific to cluster computing or linux but is on all PCs and is why many people are having a lot of trouble running the function). Additionally, other parameters such as SnowParam() or BatchtoolsParam() did not work.

Now I might try to do fitGAM() on the whole object (without extracting separately the pseudotimes and cellweights), however we are only interested in 2 out of 7 lineages, do we still need to fit genes on all 7 lineages ? Ideally we would also like to focus on all protein coding genes expressed in more than 5 cells by doing rownames(sds[rowSums(assays(sds)$counts > 0) > 5,]) and without mito and ribo genes, which increase to about 15k genes, the total runtime might still be several days (maximum runtime is 8 days on the cluster computing node), so if we can only do 2 lineages it would reduce runtime (but the 2 lineages account for abour 30k cells out of 40k in the total object).

As for branch point, we did not subset the object, basically both lineages start at the same point (all 7 lineages actually) and follow the same path until about 2/3 of the pseudotime where they bifurcate and finish in two different clusters, what we are really interested in are the genes that govern this fate choice, which is also influenced by the two conditions.

HectorRDB commented 3 months ago

Ok lots of (interesting) points:

the issue with MulticoreParam() is indeed one to take up with the relevant package. We have noticed many issues but quite inconsistent with this package, so it's hard to give advice.
Regarding what to focus on: first subsetting to protein-coding genes is fine, and should actually give you more power. Secondly, if your 2 lineages account for 75% of all cells, focusing on them should be ok. This will also makefitGam() much faster (far fewer parameters to estimate). Additionally, in many setting (although it's impossible to know for yours without better knowledge of the data), we have seen that many smaller lineages tend to be spurious, and not really biologically meaningful.
Keeping all the lineages is indeed the right choice.

Alexis-Varin commented 3 months ago

Hi, I launched fitGAM() on our whole object and 15k genes corresponding to protein coding genes expressed in more than 5 cells, so 40k cells, 7 lineages and 2 conditions, behavior with 1 worker in MulticoreParam() and 36 cores asked on Slurm seems to be as expected when looking with top and ps Linux commands, no nested parallelism, based on the current state, I expect it to take 4-5 days. I will update when it passes and how to navigate through with patternTest, ConditionTest and predictSmooth when we have multiple lineages and conditions to draw the correct heatmap.

HectorRDB / condiments

patternTest on separate fitGAM objects #34