FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data
BSD 3-Clause "New" or "Revised" License
190 stars 75 forks source link

Does OptiType care about read groups? #96

Closed Stikus closed 5 years ago

Stikus commented 5 years ago

As said here - OptiType cares about read pairs. But what about read groups? Here is our situation - we have WGS-data with several lines - e.g. not pair of fastq.gz but 8 files with each 2 paired. Can you propose solution how to use OptiType on them? Do we need to align then first (with Razers3 or Yara or even BWA Mem - we want to try different aligners coz Razers3 crashes on big genome files with unpredictable result) and merge BAM's after? Or we can simply summarize fastq's into 2 big paired fastq files and feed them to OptiType? What is best practice?

andras86 commented 5 years ago

If you have the individual, smaller bam files already, I would samtools cat those of the first ends, then the second ends, and call OptiType on these two concatenated bam files. What you can also do is call Yara with <(zcat lane1_1.fq.gz lane2_1.fq.gz lane3_1.fq.gz lane4_1.fq.gz) as input, instead of creating an actual big file on your disk. I'd probably do this for future samples.

Stikus commented 5 years ago

@andras86 Thank you for fast answer. If I understand you correctly - we should align first and second ends separately with Yara and start OptiType with 2 BAMs?

You propose Yara instead of Razers3 as aligner because we have problems with Razer? Or Yara is better (according to developer page - yes, it is better, but my tests for aligners still in progress)?

Check out our newer, faster read aligner Yara

And here:

we recommend to use the yara read mapper instead.

And if Yara is better - is there any possibility for replacing Razers3 with Yara in OptiType itself? I find your workaround described here - it is still best way to use Yara for OptiType or something new was proposed?

andras86 commented 5 years ago

Yes, exactly, align end1 and end2 separately, and call OptiType on the two resulting bam files.

Yara used to have a few niggly issues with full sensitivity, which have been rectified recently, so they should produce equivalent results under all circumstances now, and there's no reason to stick with RazerS3 anymore. You're right Yara would be more suitable to be the default rather than the workaround read mapper now, I'll keep you posted.

Stikus commented 5 years ago

@andras86 I'm trying to implement Yara in my script but I have problem with your executive command yara_mapper -e 3 -f bam -u -os /path/to/ref path/to/reads_1.fastq.gz | samtools view -h -F 4 -b1 My Yara help doesn't have such keys (-u && -os):

Yara help

``` yara_mapper - Yara Mapper ========================= SYNOPSIS yara_mapper [OPTIONS] yara_mapper [OPTIONS] DESCRIPTION Yara - Yet Another Read Aligner. See http://www.seqan.de/projects/yara for more information. (c) Copyright 2011-2014 by Enrico Siragusa. (c) Copyright 2013 by NVIDIA Corporation. REQUIRED ARGUMENTS REFERENCE_INDEX_PREFIX INPUT_PREFIX An indexed reference genome. READS_FILE List of INPUT_FILE's Either one single-end or two paired-end / mate-pair read files. Valid filetypes are: .sam[.*], .raw[.*], .gbk[.*], .frn[.*], .fq[.*], .fna[.*], .ffn[.*], .fastq[.*], .fasta[.*], .faa[.*], .fa[.*], .embl[.*], and .bam, where * is any of the following extensions: gz, bz2, and bgzf for transparent (de)compression. OPTIONS -h, --help Display the help message. --version-check BOOL Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES, 0, OFF, FALSE, F, and NO. Default: 1. --version Display version information. -v, --verbose Displays global statistics. -vv, --very-verbose Displays extensive statistics for each batch of reads. Output Options: -o, --output-file OUTPUT_FILE Specify an output file. Default: write the file to standard output. Valid filetypes are: .sam[.*] and .bam, where * is any of the following extensions: gz, bz2, and bgzf for transparent (de)compression. -f, --output-format STRING Specify an output format. Note: when specifying the option --output-file, the output format is taken from the filename extension. One of bam, sam, sam.bgzf, sam.gz, and sam.bz2. Default: sam. -rg, --read-group STRING Specify a read group for all records in the SAM/BAM file. Default: none. -sa, --secondary-alignments STRING Specify whether to output secondary alignments in the XA tag of the primary alignment, as separate secondary records, or to omit them. One of tag, record, and omit. Default: tag. -ra, --rabema-alignments Output alignments compatible with the Read Alignment BEnchMArk (RABEMA). Mapping Options: -e, --error-rate INTEGER Consider alignments within this percentual number of errors. Increase this threshold to increase the number of mapped reads. Decrease this threshold to decrease the runtime. In range [0..10]. Default: 5. -s, --strata-rate INTEGER Consider suboptimal alignments within this percentual number of errors from the optimal alignment. Increase this threshold to increase the number of alternative alignments at the expense of runtime. In range [0..10]. Default: 0. -y, --sensitivity STRING Sensitivity with respect to edit distance. Full sensitivity guarantees to find all considered alignments but increases runtime, low sensitivity decreases runtime by breaking such guarantee. One of low, high, and full. Default: high. Paired-End Mapping Options: -ll, --library-length INTEGER Expected library length. Default: autodetected. In range [1..inf]. -ld, --library-deviation INTEGER Deviation from the expected library length. Default: autodetected. In range [1..inf]. -i, --indel-rate INTEGER Rescue unaligned ends within this percentual number of indels. In range [0..50]. Default: 25. -ni, --no-indels Turn off the rescue of unaligned ends containing indels. Performance Options: -t, --threads INTEGER Specify the number of threads to use. In range [1..2048]. Default: 14. VERSION Last update: 2018-10-18_17:13:22_+0200 yara_mapper version: 0.9.11 [55b8b1f] SeqAn version: 2.4.0 ```

Can you tell me what they mean or replacement for them?

andras86 commented 5 years ago

What is your Yara version? yara_mapper --version

Stikus commented 5 years ago

Bottom of help:

VERSION
    Last update: 2018-10-18_17:13:22_+0200
    yara_mapper version: 0.9.11 [55b8b1f]
    SeqAn version: 2.4.0

Should I use more recent version? This is last from master. Here is my installation from Dockerfile:

cd "$SOFT" \
   && git clone https://github.com/seqan/seqan.git \
   && mkdir -p "$SOFT/yara-build" \
   && cd "$SOFT/yara-build" \
   && cmake "$SOFT/seqan" -DSEQAN_BUILD_SYSTEM=APP:yara -DCMAKE_CXX_COMPILER=/usr/bin/g++-4.9 \
   && make -j"$(($(nproc)+1))" all \
   && mkdir -p "$SOFT/yara-0.9.11/bin" \
   && mv -t "$SOFT/yara-0.9.11/bin" "$SOFT/yara-build/bin/yara_indexer" "$SOFT/yara-build/bin/yara_mapper" \
   && cd "$SOFT" \
   && rm -rf "$SOFT/seqan" \
   && rm -rf "$SOFT/yara-build"
andras86 commented 5 years ago

In that case, try yara_mapper -e 3 -y full -t 4 /path/to/ref /path/to/reads_1.fq.gz | samtools view -h -F 4 -b1 - instead. I added -t 4 because Yara uses as many threads as you have cores by default, and you may not find that desirable, so it's best to limit the number of threads. Runtime doesn't decrease proportionally with the number of threads anyway.

Stikus commented 5 years ago

New question - what does -b1 flag mean for samtools view? Cannot find it in help again :)

Samtools view help

``` Usage: samtools view [options] || [region ...] Options: -b output BAM -C output CRAM (requires -T) -1 use fast BAM compression (implies -b) -u uncompressed BAM output (implies -b) -h include header in SAM output -H print SAM header only (no alignments) -c print only the count of matching records -o FILE output file name [stdout] -U FILE output reads not selected by filters to FILE [null] -t FILE FILE listing reference names and lengths (see long help) [null] -L FILE only include reads overlapping this BED FILE [null] -r STR only include reads in read group STR [null] -R FILE only include reads with read group listed in FILE [null] -q INT only include reads with mapping quality >= INT [0] -l STR only include reads in library STR [null] -m INT only include reads with number of CIGAR operations consuming query sequence >= INT [0] -f INT only include reads with all of the FLAGs in INT present [0] -F INT only include reads with none of the FLAGS in INT present [0] -G INT only EXCLUDE reads with all of the FLAGs in INT present [0] -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the fraction of templates/read pairs to keep; INT part sets seed) -M use the multi-region iterator (increases the speed, removes duplicates and outputs the reads as they are ordered in the file) -x STR read tag to strip (repeatable) [null] -B collapse the backward CIGAR operation -? print long help, including note about region specification -S ignored (input format is auto-detected) --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE -T, --reference FILE Reference sequence FASTA FILE [null] -@, --threads INT Number of additional threads to use [0] ```

samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.

Big thanks for your answers and help :)

Upd: found -b1 description as -b and -1 combined.

Stikus commented 5 years ago

Thank you for help - looks like everything working fine. This is my executive commands: $YARA_MAPPER -e 3 -y full -t $YARACPUS $YARA_DNA_INDEX <(unpigz -c ${yaraNorm1Input[*]}) | $SAMTOOLS view -@ $SAMVIEWCPUS -F 4 -b1 -o $yaraMappedNorm1BAM

python3 $OPTITYPE -i $yaraMappedNorm1BAM $yaraMappedNorm2BAM --dna -p ${sampleName}_normal -o $optitypeOutDir

Looking forward for your Yara implementation as default mapper for OptiType.