Closed jashapiro closed 3 years ago
As discussed https://github.com/AlexsLemonade/alsf-scpca/issues/4#issuecomment-678374819, for benchmarking purposes we can use prebuilt indexes. I have downloaded these (partial and full decoy) from http://refgenomes.databio.org and placed them in s3://nextflow-ccdl-data/reference/homo_sapiens/refgenomes-hg38/
Initial results are presented in #18. Most interesting at first are the results in trace.txt, which presents the memory and CPU usage of each process. A description of the fields can be found here https://www.nextflow.io/docs/latest/tracing.html#trace-report
Notably, the full_sa jobs use much more memory, but do not seem to take much longer to run. However, the memory requirements are within the m4.2xlarge
instance size when run with that instance's 8 cpus, making it seem likely that using the full selective alignment index is worth doing.
I have not yet looked at mapping rate comparisons for kmer size cDNA vs transcriptome indexes or the number of ncRNA that appear in the samples. The full mapping results can be found at s3://nextflow-ccdl-results/scpca-benchmark/alevin-quant
I have not gotten a formal analysis done, but I wanted to share some insights that I have gotten so far (some of which were in slack, but deserve posting here). I am analyzing the results in https://github.com/AlexsLemonade/alsf-scpca/blob/jashapiro/benchmark-analysis/workflows/alevin-quant/benchmark-analysis.Rmd but I have not yet filed a PR from that branch.
The overall mapping rates for these samples seem low, in the 13-20% range for the two samples that I looked at. Looking at the mapped data though, things look pretty normal, with good numbers of mapped reads per cell and no over-abundance of mitochondrial reads. We did not do any kind of trimming on these data; I may give them a pass through fastp
just to see if there is generally low quality sequence that gets removed that way and might explain the low mapping rate.
Mapping rates are somewhat higher (~2%) with the txome that includes ncRNA; for a net of about 10% of mapped transcripts being noncoding. Based on (ongoing) preliminary analysis; the mapping for coding DNA is unaffected by the inclusion of these transcripts, as would be expected. They might of course have some effect on normalization down the line. Also of note, the cDNA set includes pseudogenes (some of which are consistently expressed at the RNA level in these data), which is something I hadn't previously appreciated.
The ncRNA data does include lncRNAs, at least one of which came up as the highest expressed ncRNA gene in the first sample I looked at: MALAT1, which does seem to have some cancer association in the literature. The next two most common ncRNA were mitochondrial rRNA, which is not surprising!
My preliminary leaning is to go with the full transcriptome (cDNA + ncRNA), using a full SA index, as neither decision seems to have much real cost (the instances we use have enough memory to handle it) and should improve accuracy overall.
Opened #63 to discuss updated benchmarking.
With indexes for salmon and kallisto made, we can start benchmarking the two on real data. We are interested in memory usage and speed, as well as concordence between the different mapping methods.
For salmon, we should also look at the basic mapping vs decoy-aware mapping. I (and apparently developers of salmon) would prefer decoy-aware, but if it uses much more memory and/or time this may not be feasible.