dib-lab / elvers

(formerly eelpond) an automated RNA-Seq workflow system
https://dib-lab.github.io/elvers/
Other
28 stars 3 forks source link

combine units prior to khmer #74

Open bluegenes opened 5 years ago

bluegenes commented 5 years ago

k-mer trimming and diginorm should happen on the full sample, not each sample-unit pair of files. We now have a function to do this (currently in the salmon rule) that can now be moved forward.

Questions:

  1. There is some benefit to allowing users to fastqc each file separately to see quality. By default, should we: A. combine units at the very beginning of the workflow (get_fq) and only work at the sample level? B. combine units after trimmomatic, allowing pre-trimming fastqc at the sample-unit level, but nothing else.

  2. Is it worth it to enable pipelines on the sample-unit level, rather than just sample? We could provide an eelpond_params parameter for each program that we populate via a single flag in run_eelpond, or parameter in the main configfile. Otherwise, just move everything to the sample level.

cc: @ctb @ljcohen

johnsolk commented 5 years ago

Not sure if this is the best way. For combining files, I've written interleaved pairs first then use streaming command with wildcard, e.g.

interleave-reads.py {}{} {}{} | gzip > {}{}.interleaved.fq.gz
(zcat {}*.interleaved.fq.gz) && zcat {}{}.orphans.fq.gz | \\
(trim-low-abund.py -V -k 20 -Z 18 -C 2 - -o - -M 4e9 --diginorm --diginorm-coverage=20) | \\
(extract-paired-reads.py --gzip -p {}{}.paired.gz -s {}{}.single.gz) > {}{}.diginorm.log
rm -rf {}{}*.interleaved.fq.gz
bluegenes commented 5 years ago

I think that's what we do now? Or do you mean all samples should be combined prior to kmer trimming? https://github.com/dib-lab/eelpond/blob/master/rules/khmer/khmer.rule

johnsolk commented 5 years ago

kmer trimming and diginorm should be done only once all *.interleaved reads. so coverage is estimated across all samples rather than on each pair. the command now is taking output from each interleave paired reads step and running trim-low-abund and diginorm multiple times, e.g. 10 times rather than only once.

bluegenes commented 5 years ago

+1, but have definitely run into memory issues before with large datasets.

bluegenes commented 5 years ago

@ctb do you have any suggestions for ways to avoid writing (and then deleting) a bunch of intermediate files during the interleave-reads step? You mentioned potentially named pipes? Also have you run into memory issues, or is this something I shouldn't worry about so much?

bluegenes commented 5 years ago

It strikes me that we have two different needs for kmer-trimmed reads:

  1. sample-specific kmer-trimmed reads for sourmash
  2. single-file kmer-trimmed (& optionally diginormed) reads for assembly.

If sourmash is being run anyway, would we gain anything (e.g. reduced memory utilization) by first conducting file-specific (non-diginorm) kmer trimming first (save files to use with sourmash), then combining all files and conduct kmer-trimming + diginorm on all data together (save files for assembler)?

ctb commented 5 years ago

On Sun, Feb 03, 2019 at 06:28:47PM -0800, Tessa Pierce wrote:

It strikes me that we have two different needs for kmer-trimmed reads:

  1. sample-specific kmer-trimmed reads for sourmash
  2. single-file kmer-trimmed (& optionally diginormed) reads for assembly.

If sourmash is being run anyway, would we gain anything (e.g. reduced memory utilization) by first conducting file-specific (non-diginorm) kmer trimming first (save files to use with sourmash), then combining all files and conduct kmer-trimming + diginorm on all data together (save files for assembler)?

hmm memory utilization should not be a problem.

but yes, things can run in parallel as you say, which in some (many?) circumstances may be faster.

johnsolk commented 5 years ago

Suggested command for streaming interleave -> trim/diginorm -> extract. (found in my notes from 2018 with help from @standage :)

stream_diginorm_command = """(paste \\
<(cat {{{}}} | paste - - - - ) \\
<(cat {{{}}} | paste - - - - ) \\
| tr '\\t' '\\n'  ) | \\
(trim-low-abund.py -V -k 20 -Z 18 -C 2 - -o - -M 4e9 --diginorm --diginorm-coverage=20) | \\
(extract-paired-reads.py --gzip -p {}{}.paired.gz -s {}{}.single.gz) > /dev/null
""".format(filestring_1P,filestring_2P,new_dir, species, new_dir, species)