AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Adjust default index references for workflows #84

Closed allyhawkins closed 3 years ago

allyhawkins commented 3 years ago

I am updating the default index's for each of the workflows to be the spliced_txome_k31 index that was generated from the spliced only cDNA fasta as in #68. Currently, in order to switch to use the spliced_intron_txome_k31 you will need to do so at the command line by doing nextflow run alevin-quant/run-alevin.nf --index_name spliced_intron_txome_k31. I have also updated the default t2gene.tsv files to use the Homo_sapiens.GRCh38.103.spliced.tx2gene.tsv file.

Additionally, while running the workflows with the snRNA-seq data, I noticed that they required more memory than previously. The pre-mRNA index generated by Kallisto, in particular, is 45 gb and requires ~ 100 gb of memory for the samples that I did test runs on. Otherwise it is unable to load the index and write the output files. I am using 120 gb of memory here to be safe.

It appears that Alevin also requires slightly more memory than the 28 gb allotted by the cpus_8 label in the configuration file as well, but nowhere near the 100 gb that Kallisto requires.

I did spend some time trying to assign individual labels based on the input variable so that we could only use the higher memory requirements for Kallisto with snRNA-seq samples only, but it appears that there is no way to use input variables as label assignments... https://github.com/nextflow-io/nextflow/issues/894. There is one potential work around that I was able to find to maybe include a second configuration file based on an input parameter, but after some back and forth with that, I landed on leaving it as is for now.

I am leaving this in draft stage right now, as I still need to change the references in alevin-fry and ensure that workflow runs as expected.