GabeAl / NINJA-OPS

Ninja Is Not Just Another OTU Picking Solution
ISC License
23 stars 6 forks source link

Format of the input file #14

Closed jcgrenier closed 7 years ago

jcgrenier commented 7 years ago

Hi,

I'm presently testing NINJA-OPS and comparing it to QIIME pipeline. What I have done presently and that is making a lot of sense, is preprocessing my input files with various steps of the QIIME pipeline, in order to get, from my original trimmed fastq files (trimmed for adapters with cutadapt) coming from a HISeq2000, fasta files containing sequences without chimeras.

One thing that is differing mainly is that with QIIME, I'm starting by merging the fwd and reverse reads (I have paired end data). I'm trimming those reads after for a particular length (240) and removing finally the chimeric sequences found by uchimera. I'm after putting all the the fasta files in the same file and label the reads using add_qiime_labels.py.

The input files that I'm using with NINJA-OPS is mainly the cutadapt trimmed fastq that I've converted to fasta files, and that I've relabelled the same way as the other approach.

For QIIME-prefiltered files, I'm considering 86055052 reads at the beginning, and ending with 0 chimeric alignments, and 61354200 expanded reads. Another detail is that I have 1152 unique samples with max reads of 7483261.

For the other approach, I'm considering 97138460 reads (so more), 0 chimeric alignment and 6343778 expanded reads. For 1152 unique samlpes with 34237427 max reads.

I found weird that this max reads thing is so different. What is it referring to? I've looked in the merged fasta file to make sure I wasn't doing any mistake, but my file seems ok

grep "^>" combinedseqs.fna | sed 's//\t/' | cut -f 1 | sort | uniq -c > seqPerSample.txt

I'm getting a max of 261000 reads for one of my sample, no 34M or 7M.

Here's the command line that I used, I've tried two sets of parameters, the later one giving me less OTUs per sample :

ninja_filter_linux" "5209_combined.R1/combined_seqs.fna" PE "5209_combined.R2/combined_seqs.fna" "combined_fromFasta_trimmed/ninja" D 1 ninja_filter_linux" "5209_combined.R1/combined_seqs.fna" PE "5209_combined.R2/combined_seqs.fna" "combined_fromFasta_trimmed.max.trim/ninja" "150,125" D 1

With the QIIME-prefiltered fasta file

ninja_filter_linux" "combined_seqs.fna" "qiime/ninja/combined/ninja" D 1

Thanks for helping me with this!

JC

GabeAl commented 7 years ago

Hi @jcgrenier ,

Thanks for using the software! The input format is just linearized fasta with a trailing newline at the end of the file. Each line starting from the first should alternate between header and sequence, with no multi-line sequences or blanks between sequence and subsequent header. Given your experience with QIIME's final formatting, I think you're set!

You bring up a good point about the max reads -- unfortunately NINJA-OPS never outgrew this particular debugging message, which actually does not reflect anything useful to the user downstream. The message indicates that given the intermediates generated previously, before actually reading the file and evaluating its length, what is the upper bound on the maximum possible number of unique reads NINJA has to be able to contend with before allocating data structures large enough to handle them. It usually sits comfortably between the number of unique reads (the length of the intermediate file "ninja_filtered.fa") and the number of such reads that were successfully mapped.

As for the QC steps, we have our NINJA-SHI7 pipeline (or the extended python version, NINJA-shi7en) which does similar QC steps all in one. It starts by gently trimming the appropriate adapters, merging paired ends (marker genes with overlaps only!) according to the expected length of your marker (say, 292bp for V4 16S), trimming off just the far ends likely to be remaining adapter/barcode contaminants, converting to fasta and merging in QIIME compatible format. It also will soon include an oligo splitting module (which I wrote previously as a standalone component) that will take barcoded reads (including ambiguities and staggered ends!) and split a single fastq into its component samples based on their respective dual-index barcodes (including error-correction!) in about the time it takes to read the data from the disk. So combining SHI7 with OPS can literally bring you from machine to biom file.

https://github.com/knights-lab/shi7en

We would highly recommend sticking to your approach of merging reads whenever possible, since figuring out post-hoc where pairs could have aligned is a much slower and stickier problem than the relatively straightforward stitching process.

We will remove the more misleading internal 'max' stat from the non-debugging output in an upcoming release! In the interim, feel free to look at any of the intermediate files produced -- their format is relatively straightforward (the _filt.fa contains the unique sequences with headers as index, and the .db file contains all samples and a mapping of each unique read to indexed sample, with the first entry after the sample IDs corresponding to the first sequence in the _filt.fa and so forth, and each line indicating how many of that read were found in each sample).

Cheers, Gabe

jcgrenier commented 7 years ago

Hi @GabeAl ,

Thanks for your quick reply! Thanks for letting me know about that new tool! I will test it to see if I can get to something closer to my other approach! Thanks for the advice concerning the merging step of the reads too. That is probably one of the main difference between my two pipelines actually.

Have a great day!

JC