Open bluegenes opened 5 years ago
Not sure if this is the best way. For combining files, I've written interleaved pairs first then use streaming command with wildcard, e.g.
interleave-reads.py {}{} {}{} | gzip > {}{}.interleaved.fq.gz
(zcat {}*.interleaved.fq.gz) && zcat {}{}.orphans.fq.gz | \\
(trim-low-abund.py -V -k 20 -Z 18 -C 2 - -o - -M 4e9 --diginorm --diginorm-coverage=20) | \\
(extract-paired-reads.py --gzip -p {}{}.paired.gz -s {}{}.single.gz) > {}{}.diginorm.log
rm -rf {}{}*.interleaved.fq.gz
I think that's what we do now? Or do you mean all samples should be combined prior to kmer trimming? https://github.com/dib-lab/eelpond/blob/master/rules/khmer/khmer.rule
kmer trimming and diginorm should be done only once all *.interleaved reads. so coverage is estimated across all samples rather than on each pair. the command now is taking output from each interleave paired reads step and running trim-low-abund and diginorm multiple times, e.g. 10 times rather than only once.
+1, but have definitely run into memory issues before with large datasets.
@ctb do you have any suggestions for ways to avoid writing (and then deleting) a bunch of intermediate files during the interleave-reads
step? You mentioned potentially named pipes? Also have you run into memory issues, or is this something I shouldn't worry about so much?
It strikes me that we have two different needs for kmer-trimmed reads:
If sourmash is being run anyway, would we gain anything (e.g. reduced memory utilization) by first conducting file-specific (non-diginorm) kmer trimming first (save files to use with sourmash), then combining all files and conduct kmer-trimming + diginorm on all data together (save files for assembler)?
On Sun, Feb 03, 2019 at 06:28:47PM -0800, Tessa Pierce wrote:
It strikes me that we have two different needs for kmer-trimmed reads:
- sample-specific kmer-trimmed reads for sourmash
- single-file kmer-trimmed (& optionally diginormed) reads for assembly.
If sourmash is being run anyway, would we gain anything (e.g. reduced memory utilization) by first conducting file-specific (non-diginorm) kmer trimming first (save files to use with sourmash), then combining all files and conduct kmer-trimming + diginorm on all data together (save files for assembler)?
hmm memory utilization should not be a problem.
but yes, things can run in parallel as you say, which in some (many?) circumstances may be faster.
Suggested command for streaming interleave -> trim/diginorm -> extract. (found in my notes from 2018 with help from @standage :)
stream_diginorm_command = """(paste \\
<(cat {{{}}} | paste - - - - ) \\
<(cat {{{}}} | paste - - - - ) \\
| tr '\\t' '\\n' ) | \\
(trim-low-abund.py -V -k 20 -Z 18 -C 2 - -o - -M 4e9 --diginorm --diginorm-coverage=20) | \\
(extract-paired-reads.py --gzip -p {}{}.paired.gz -s {}{}.single.gz) > /dev/null
""".format(filestring_1P,filestring_2P,new_dir, species, new_dir, species)
k-mer trimming and diginorm should happen on the full
sample
, not eachsample-unit
pair of files. We now have a function to do this (currently in the salmon rule) that can now be moved forward.Questions:
There is some benefit to allowing users to fastqc each file separately to see quality. By default, should we: A. combine units at the very beginning of the workflow (
get_fq
) and only work at thesample
level? B. combine units after trimmomatic, allowing pre-trimming fastqc at thesample-unit
level, but nothing else.Is it worth it to enable pipelines on the
sample-unit
level, rather than just sample? We could provide aneelpond_params
parameter for each program that we populate via a single flag in run_eelpond, or parameter in the main configfile. Otherwise, just move everything to thesample
level.cc: @ctb @ljcohen