I started to work on minor improvements and small fixes on a dev branch. The biggest motivation is to reduce the footprint for large input files ( i.e. not RAM or memory, but disk space used, number of files generated, etc. ). The documentation has also been updated, where relevant.
Immediate improvements/suggestions
[X] Do not uncompress fastq.gz as input for minimap2, and pipe output to samtools (BAM).
[X] Control the number of cores/threads per rule ( e.g. to pass as argument to minimap2, or singleCellPipe). Up to now, threads was not actually used in the workflow, e.g. we could have minimap2 -t 1 submitted with sbatch -c 36, i.e. rules used 1 core by default. The problem is that the cluster_config is not available for shell commands within snakemake rules. I tried several fixes but couldn't find a way to make threads from cluster.json available within rules. I decided to temporarily set threads in selected rules in Snakefile. For now, we leave it as is. See also next point.
[ ] Cluster configuration has been deprecated by the introduction of Profiles (but as far as I'm aware this does not seem to solve the above problem).
Longer-term improvements/suggestions
[X] The repo should include a .gitignore to ignore all build/compilation-related and tmp files to allow a smooth git workflow.
[ ] Upgrade to snakemake >= 5.2.0. There are various reasons to do so, e.g. I think the current snakemake version does not support basic pathlib compatibility, but this would make the code much cleaner. See also next point, and above regarding profiles.
[ ] For all temporary files, create a tmp directory under the directory specified as tmp_dir in the config file (to avoid writing tmp files under the main repository). This directory should be deleted on completion, whether successful or not. From snakemake 5.2.0, we could use directories as outputs and mark files as temporary.
[X] Upgrade minimap2. We currently have 2.17-r941, and there were many bug fixes since... besides we could use two I/O threads during mapping since we now pipe the output (that would however increase peak RAM), etc.. We could also add the option to load the entire reference into memory. Upgraded to 2.21-r1071. Using option -2.
[X] General documentation should be improved for readability, etc. Some parameters would benefit from a better description, define default values and/or suggested values for standard use-case scenarios, e.g. polyTlength, numSimReads, etc.
[X] The test dataset issue has been partly addressed, and we are now able to do a quick test run upon install. However the reproducibility issue has not yet been addressed, but this does not seem straightforward. I would nevertheless suggest a few things: (1) manually seed the random number generator for simulator.py, (2) "seed" shuf in build_genome. The UMIs for blacklist barcodes are the only randomly generated sequences. I leave it as is for now.
Questions
[ ] There are a lot of sort | uniq, are all these necessary in the workflow? In particular, was there any reason SAM/BAM files were sorted with sort -snk3 -k4 | uniq? I changed that to just a plain piped samtools sort.
[ ] real.label and real.umi have a different column ordering... actually do we need to use Nanopore.geneumi with option 2?
[ ] It is still not entirely clear to me how UMIs are handled in the case of the probabilistic model (option 1). I understand that UMI alignments are not used, but yet, umilength is still used to simulate reads, and also as input for Flexbar ( singleCellPipe parameter -ul ). A file real_umi.fasta is generated, but if using option 1 (probabilistic model), this file is actually not used.
And what if the Nanopore data has actually no UMIs? Can scNapBar deal with this kind of data? Should we simulate them or not? And what does the real_umi.fasta contains in this case?
This is a follow-up from https://github.com/dieterich-lab/single-cell-nanopore/issues/11
I started to work on minor improvements and small fixes on a dev branch. The biggest motivation is to reduce the footprint for large input files ( i.e. not RAM or memory, but disk space used, number of files generated, etc. ). The documentation has also been updated, where relevant.
Immediate improvements/suggestions
[X] Do not uncompress fastq.gz as input for minimap2, and pipe output to samtools (BAM).
[X] Updated NanoSim to next release (waiting for next release). Work in progress, see Could not retrieve index file for primary.bam #130 and Breaking changes in format output from 3.0.0-beta Also this release will likely not be available via conda. Done, now using NanoSim 3.0.2 from bioconda.
[X] Control the number of cores/threads per rule ( e.g. to pass as argument to minimap2, or singleCellPipe). Up to now,
threads
was not actually used in the workflow, e.g. we could haveminimap2 -t 1
submitted withsbatch -c 36
, i.e. rules used 1 core by default. The problem is that thecluster_config
is not available for shell commands within snakemake rules. I tried several fixes but couldn't find a way to makethreads
fromcluster.json
available within rules. I decided to temporarily setthreads
in selected rules inSnakefile
. For now, we leave it as is. See also next point.[ ] Cluster configuration has been deprecated by the introduction of Profiles (but as far as I'm aware this does not seem to solve the above problem).
Longer-term improvements/suggestions
[X] The repo should include a .gitignore to ignore all build/compilation-related and tmp files to allow a smooth git workflow.
[ ] Upgrade to
snakemake >= 5.2.0
. There are various reasons to do so, e.g. I think the current snakemake version does not support basic pathlib compatibility, but this would make the code much cleaner. See also next point, and above regarding profiles.[ ] For all temporary files, create a tmp directory under the directory specified as
tmp_dir
in the config file (to avoid writing tmp files under the main repository). This directory should be deleted on completion, whether successful or not. From snakemake 5.2.0, we could use directories as outputs and mark files as temporary.[X] Upgrade minimap2. We currently have
2.17-r941
, and there were many bug fixes since... besides we could use two I/O threads during mapping since we now pipe the output (that would however increase peak RAM), etc.. We could also add the option to load the entire reference into memory. Upgraded to 2.21-r1071. Using option-2
.[X] General documentation should be improved for readability, etc. Some parameters would benefit from a better description, define default values and/or suggested values for standard use-case scenarios, e.g. polyTlength, numSimReads, etc.
[X] The test dataset issue has been partly addressed, and we are now able to do a quick test run upon install. However the reproducibility issue has not yet been addressed, but this does not seem straightforward. I would nevertheless suggest a few things: (1) manually seed the random number generator for
simulator.py
, (2) "seed"shuf
inbuild_genome
. The UMIs for blacklist barcodes are the only randomly generated sequences. I leave it as is for now.Questions
[ ] There are a lot of
sort | uniq
, are all these necessary in the workflow? In particular, was there any reason SAM/BAM files were sorted withsort -snk3 -k4 | uniq
? I changed that to just a plain pipedsamtools sort
.[ ] real.label and real.umi have a different column ordering... actually do we need to use Nanopore.geneumi with option 2?
[ ] It is still not entirely clear to me how UMIs are handled in the case of the probabilistic model (option 1). I understand that UMI alignments are not used, but yet,
umilength
is still used to simulate reads, and also as input for Flexbar ( singleCellPipe parameter-ul
). A file real_umi.fasta is generated, but if using option 1 (probabilistic model), this file is actually not used. And what if the Nanopore data has actually no UMIs? Can scNapBar deal with this kind of data? Should we simulate them or not? And what does the real_umi.fasta contains in this case?