NCBI-Hackathons / ViruSpy

A pipeline for viral identification from metagenomic samples
MIT License
26 stars 4 forks source link

A potential reason for short contigs #3

Open DCGenomics opened 7 years ago

DCGenomics commented 7 years ago

We may need to do an after-magicBLAST cleanup, where we extract the reads and pairs back out of the original SRA. Duplications and unpaired reads may be throwing off the aligners. Unpaireds may be interesting, but we should be able to pull them back out with BUD.

(vdb-dump) works well for this.

To wit, with the results from ViruSpy:

grep SRR1161435.123159 SRR1161435_putative_viral_reads.fastq @SRR1161435.123159.1/1 @SRR1161435.123159.2/2 @SRR1161435.123159.2/2 @SRR1161435.123159.2/2 @SRR1161435.123159.2/2

grep SRR1553459.9969 SRR1553459_putative_viral_reads.fastq @SRR1553459.9969.1/1 @SRR1553459.9969.2/2 @SRR1553459.9969.1/1 @SRR1553459.9969.2/2 @SRR1553459.9969.1/1 @SRR1553459.9969.2/2 @SRR1553459.9969.1/1 @SRR1553459.9969.2/2 @SRR1553459.9969.1/1 @SRR1553459.9969.2/2 @SRR1553459.9969.1/1 @SRR1553459.9969.2/2

grep SRR5675673.32209 SRR5675673_putative_viral_reads.fastq @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2 @SRR5675673.32209.1/1 @SRR5675673.32209.2/2

pcantalupo commented 7 years ago

It would be nice if MB only reported at most 1 valid alignment. To wit, I submitted an issue to MB repo.

As Greg wrote in the Issue trail above "Number of alignments is reported for each alignment in the NH tag in SAM output and in 18-th column of tabular output."

pcantalupo commented 7 years ago

Another reason could be this Issue: https://github.com/NCBI-Hackathons/ViruSpy/issues/4

Does anybody know what the version of MB was used (and what options) to obtain the reads for the Amazon assembly? Perhaps we assembled everything?!

DCGenomics commented 7 years ago

Heard it was the latest version, but we should start including version numbers in documentation!

On Thu, Oct 5, 2017 at 9:19 AM, Paul Cantalupo notifications@github.com wrote:

Another reason could be this Issue: #4 https://github.com/NCBI-Hackathons/ViruSpy/issues/4

Does anybody know what the version of MB was used (and what options) to obtain the reads for the Amazon assembly? Perhaps we assembled everything?!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/ViruSpy/issues/3#issuecomment-334462232, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtajP9KV7F3tkfodKNQnWKYALRfRFks5spNdlgaJpZM4PuDMy .

-- What have you done today to make the world a better place?

pcantalupo commented 7 years ago

If so, we assembled a ton of unaligned reads. I only recently added the '-no-unaligned' option

jkwaldman commented 7 years ago

Just checked - the version I used was 1.2, so it shouldn't be returning unaligned reads already.

@pcantalupo - The -no-unaligned option will cause an error if the user is using magicblast v1.2 since that isn't an available flag.

USAGE magicblast [-h] [-help] [-db database_name] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-gzo] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-perc_identity float_value] [-penalty penalty] [-lcase_masking] [-validate_seqs TF] [-infmt format] [-paired] [-query_mate infile] [-sra accession] [-parse_deflines TF] [-outfmt format] [-num_threads int_value] [-max_intron_length length] [-score num] [-splice TF] [-reftype type] [-limit_lookup TF] [-lookup_stride num] [-batch_size num] [-version]

pcantalupo commented 7 years ago

I updated the README https://github.com/NCBI-Hackathons/ViruSpy/commit/aa88e70dab7cd55555ba9ca9a8bbbb64b6ddfc5a with min version >=1.3 for MB

pcantalupo commented 7 years ago

@DCGenomics I'm working with the ebola SRA SRR1553459 right now. You report above there are duplicate reads in the output. I'm not sure how the duplicates happened. I tested both v1.4 and v1.5 of samtools fastq command and only one read is output for a read identifier even if that read is duplicated in the BAM file.

samtools has handy options (that are not yet incorporated into virusspy) for splitting the reads into a singleton and paired files: ~/local/opt/samtools-1.5/samtools fastq -s foo.single.fq -1 foo.paired1.fq -2 foo.paired2.fq -N BAMFILE. The -N option (only in v1.5) always append /1 and /2 to the read name.

Maybe a different version of samtools was used that produced the duplicates. Seems that each version of samtools does something different.

DCGenomics commented 7 years ago

Can you pass me your resulting reads from Ebola so I can pass them to the abyss and skesa teams?

I'll see if I can get some lfs space so we can do it in an above the board way.

Also, the whole pipeline is being refactored in python, and I optimistically hope that it'll be done by the end of this week. I'd also like to give pythonistas from this crew the opportunity to work on that (and there are still many stretch features to work out on this side).

From a scientific standpoint the objective is to build the most sensitive virus detection pipeline currently in existence, from an educational standpoint, to make it simple enough for undergrads to use it for surveillance, and from a professional development/communication standpoint, to send it to nar or genome research.

You folks are awesome!

On Oct 14, 2017 10:56, "Paul Cantalupo" notifications@github.com wrote:

@DCGenomics https://github.com/dcgenomics I'm working with the ebola SRA SRR1553459 right now. You report above there are duplicate reads in the output. I'm not sure how the duplicates happened. I tested both v1.4 and v1.5 of samtools fastq command and only one read is output for a read identifier even if that read is duplicated in the BAM file.

samtools has handy options (that are not yet incorporated into virusspy) for splitting the reads into a singleton and paired files: ~/local/opt/samtools-1.5/samtools fastq -s foo.single.fq -1 foo.paired1.fq -2 foo.paired2.fq -N BAMFILE. The -N option (only in v1.5) always append /1 and /2 to the read name.

Maybe a different version of samtools was used that produced the duplicates. Seems that each version of samtools does something different.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/ViruSpy/issues/3#issuecomment-336640168, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtfPATWQjeyLCTQ9c2NGg08cGFEBkks5ssMuwgaJpZM4PuDMy .

pcantalupo commented 7 years ago

@DCGenomics i sent you files by email

jkwaldman commented 7 years ago

@pcantalupo What was the command you used with samtools fastq for v1.5 that dropped the duplicate reads? Trying to replicate (having to use magicblast v1.2, which may be affecting things) and having little success. Would also appreciate if you could email the no-duplicates fastq so I can start testing spades, whenever you have time.

On Samtools versions - right now we have this line in the magicblast_w_opts.sh file:

samtools fastq $out.bam -v 40 > $out_file

This is samtools v1.3 syntax - will produce an empty file if using v1.5. It'll need to be adjusted when we decide on the right set of flags to use.

pcantalupo commented 7 years ago

I don't get duplicates with the versions of samtools that I tested: v1.3, 1.4, 1.5. I'm using magicblast v1.3. Viruspy README says to use samtools v1.5 and MB v1.3

Here is an example bam file. It contains 11 alignments. Read10 has 3 alignments therefore 2 of them are duplicates. Read 4, 5, 14, 16 are singletons. Reads 12 and 18 have both the forward and reverse reads mapped.

$ samtools view small.bam | cut -f 1,2,12
SRR1553459.4    137 NH:i:1
SRR1553459.5    89  NH:i:1
SRR1553459.10   137 NH:i:3
SRR1553459.10   393 NH:i:3
SRR1553459.10   393 NH:i:3
SRR1553459.12   83  NH:i:1
SRR1553459.12   163 NH:i:1
SRR1553459.14   89  NH:i:1
SRR1553459.16   89  NH:i:1
SRR1553459.18   65  NH:i:1
SRR1553459.18   129 NH:i:1

Running samtools v1.3 and 1.4 outputs the singletons and paired reads with no duplicates.

$ ~/local/opt/samtools-1.3.1/samtools fastq -v 40 small.bam | grep ^@
@SRR1553459.4/2
@SRR1553459.5/1
@SRR1553459.10/2
@SRR1553459.12/1
@SRR1553459.12/2
@SRR1553459.14/1
@SRR1553459.16/1
@SRR1553459.18/1
@SRR1553459.18/2
$ ~/local/opt/samtools-1.4/samtools fastq -v 40 small.bam | grep ^@
@SRR1553459.4/2
@SRR1553459.5/1
@SRR1553459.10/2
@SRR1553459.12/1
@SRR1553459.12/2
@SRR1553459.14/1
@SRR1553459.16/1
@SRR1553459.18/1
@SRR1553459.18/2

But, the default samtools fastq output for v1.5 does not include singletons:

$ ~/local/opt/samtools-1.5/samtools fastq -v 40 small.bam | grep ^@
@SRR1553459.12/1
@SRR1553459.12/2
@SRR1553459.18/1
@SRR1553459.18/2

However, you can obtain the singletons, and paired reads in separate files with the following samtools command for v1.5:

samtools fastq -s singleton.fq -1 R1.fq -2 R2.fq -N -v 40 temp_out.bam

-N will append append /1 and /2 to the read name as appropriate.

pcantalupo commented 7 years ago

@jkwaldman Based on my samtools testing above, no output with samtools v1.5 leads me to believe that you don't have any paired alignments in your bamfile when using just samtools fastq without the -s option.

jkwaldman commented 7 years ago

@pcantalupo If you execute this line of code (from magicblast_w_opts.sh) with Samtools 1.5, it produces the proper fastq file?

samtools fastq $out.bam -v 40 > $out_file

--

MB 1.2 should be doing the equivalent of the no_unaligned option anyway, so that shouldn't be the deciding factor.

pcantalupo commented 7 years ago

@jkwaldman MB 1.2 won't work with current version of magicblast_w_opts because of the -no_unaligned option:

Error: Unknown argument: "no_unaligned"
Error:  (CArgException::eInvalidArg) Unknown argument: "no_unaligned"

To answer your question, Yes it does. Here is what I did after logging into HTC:

$ module unload magic-blast/1.2.0

$ magicblast -version
magicblast: 1.3.0
 Package: magicblast 1.3.0, build Sep 13 2017 15:43:18

$ module load samtools/1.5-gcc5.2.0

$ samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.5 (using htslib 1.5)

$ cd local/lib/viruspy/

$ git pull
Already up-to-date.

$ git log | head
commit 70ac9c445481f65923ad21077cefc7216d690819
Author: jkwaldman <jkwaldman1@gmail.com>
Date:   Sun Oct 15 17:43:45 2017 -0400

$ magicblast_w_opts.sh -s SRR1553459
/ihome/jpipas/pgc92/local/lib/viruspy/scripts/magicblast_w_opts.sh: using Viral RefSeq genomic sequences from NCBI as default viral database
Downloading Viral RefSeq genomic sequences from NCBI
2017-10-16 07:24:15 URL: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz [65036954] -> "viral.1.1.genomic.fna.gz" [1]
2017-10-16 07:24:16 URL: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.1.genomic.fna.gz [14783391] -> "viral.2.1.genomic.fna.gz" [1]

Building a new DB, current time: 10/16/2017 07:24:22
New DB name:   /ihome/jpipas/pgc92/local/lib/viruspy/t/testsamtools/viral.db
New DB title:  viral.all.1.genomic.fna
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 9556 sequences in 3.78974 seconds.

/ihome/jpipas/pgc92/local/lib/viruspy/scripts/magicblast_w_opts.sh: running Magic-BLAST on SRR1553459
/ihome/jpipas/pgc92/local/lib/viruspy/scripts/magicblast_w_opts.sh: saving to './putative_viral_reads.fastq'

Converting putative viral reads back to FASTQ format...[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 30328 reads

$ l
total 440756
-rw-r--r-- 1 pgc92 jpipas   4775738 Oct 16 07:24 putative_viral_reads.fastq
-rw-r--r-- 1 pgc92 jpipas 265091874 Oct 16 07:24 viral.all.1.genomic.fna
-rw-r--r-- 1 pgc92 jpipas   1058219 Oct 16 07:24 viral.db.nhr
-rw-r--r-- 1 pgc92 jpipas    114764 Oct 16 07:24 viral.db.nin
-rw-r--r-- 1 pgc92 jpipas     38256 Oct 16 07:24 viral.db.nog
-rw-r--r-- 1 pgc92 jpipas    360908 Oct 16 07:24 viral.db.nsd
-rw-r--r-- 1 pgc92 jpipas      8087 Oct 16 07:24 viral.db.nsi
-rw-r--r-- 1 pgc92 jpipas  65302701 Oct 16 07:24 viral.db.nsq

$ head putative_viral_reads.fastq
@SRR1553459.12/1
CGTTCTGTCTTTTGTAGGATATGATCAAGTACTGTTTTGACAATCAATAGACCTGAAAAACGAGCTTGAGCCACTGAATTTGAAATCACAGCATCGTTGGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR1553459.12/2
GGTTCAAAGGCAAATTCAAGTACATGCAGAGCAAGGACTGATACAATATCCAACAGCTTGGCAATCAGTAGGACACATGATGGTGATTTTCCGTTTGATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR1553459.18/1
GCCAATCCTTATCCCGAAGTTACGGATCCGGCTTGCCGACTTCCCTTACCTCTGTCTCTTATACACATCTCCGAGCCCACGAGACGTAGAGGAATATCGTA

See if you get similar results.