USDA-VS / vSNP3

vSNP -- validate SNPs
GNU General Public License v3.0
4 stars 1 forks source link

mpilup takes 5h per sample with nanopore data #1

Open duceppemo opened 2 years ago

duceppemo commented 2 years ago

Hi Tod,

I've been trying the nanopore support of vSNP3 and I think it still needs to be optimized.

First, when installing vSNP3 with conda, it lacks 2 dependencies: vcftools and bcftools. To get a fully working pipeline (I only tested step 1 so far), I had to run:

conda create -y -n vsnp3 -c bioconda vsnp3=3.06
conda install -c bioconda vcftools bcftools
# I had a problem with some vcftools library that could be solved by creating a symbolic link
ln -s /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.1 /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.0.0

There are a few Warnings printed in the terminal while the step1 runs. The command I used:

$ vsnp3_step1.py -n -r1 '/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz' -f /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta         -b /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk

The terminal output:

vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)

Best Reference Finding with Sourmash 
2022-05-19 14:51:17

== This is sourmash version 4.4.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: /home/bioinfo/analyses/mbovis_... (k=31, DNA)
loaded 1 databases.

WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
11 matches; showing first 3:
similarity   match
----------   -----
  6.3%       NC_002945.4 Mycobacterium bovis AF2122/97 genome assembly...
  6.2%       NZ_CP041790.1 Mycobacterium tuberculosis strain SEA170200...
  6.2%       CP016401.1 Mycobacterium caprae strain Allgaeu genome

Sample: MBWGS036
Top Sourmash Finding: NC_002945.4 
Reference Set: Mycobacterium_AF2122 
Top reference that is automatically available: /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta

#############

Spoligotype 
2022-05-19 14:51:23

Align and make VCF file 
2022-05-19 14:52:36
[M::mm_idx_gen::0.136*1.01] collected minimizers
[M::mm_idx_gen::0.160*1.95] sorted minimizers
[M::main::0.160*1.95] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.184*1.83] mid_occ = 11
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.194*1.79] distinct minimizers: 770441 (96.15% are singletons); average occurrences: 1.053; average spacing: 5.362; total length: 4349904
[M::worker_pipeline::3.132*5.40] mapped 9607 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-ont -R @RG\tID:MBWGS036\tSM:MBWGS036\tPL:ILLUMINA\tPI:250 -t 8 -o MBWGS036.sam /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/MBWGS036.fastq.gz
[M::main] Real time: 3.149 sec; CPU: 16.941 sec; Peak RSS: 0.794 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[markdup] warning: unable to calculate estimated library size. Read pairs 0 should be greater than duplicate pairs 0, which should both be non zero.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf temp1.vcf
    --recode-INFO-all
    --out temp2
    --recode
    --remove-indels

Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
After filtering, kept 516 out of a possible 611 Sites
Run Time = 0.00 seconds

Zero Coverage 
2022-05-19 17:12:07
    Positions with no coverage: 12,953, 0.297777% of reference

MBWGS036 Poor FASTQ Usability
MBWGS036 Acceptable Reference Usability

As you can notice, the top reference has a very low % value. It still picks the right one, but this part of the pipeline is not optimized for Nanopore. Also, why is it still looking for the best reference is we already told which one to use?

The log file looks like this:


vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='MBWGS009.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)

Call Summary:
SYSTEM CALL: minimap2 -a -x map-ont -R "@RG\tID:MBWGS009\tSM:MBWGS009\tPL:ILLUMINA\tPI:250" -t 8 /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/MBWGS009.fastq.gz -o MBWGS009.sam -- 2022-05-19_17:45:11
SYSTEM CALL: samtools fixmate -O bam,level=1 -m MBWGS009.sam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools sort -l 1 -@8 -o MBWGS009_pos_srt.bam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools markdup -f markduplicate_stats.txt -r -O bam,level=1 MBWGS009_pos_srt.bam MBWGS009_nodup.bam -- 2022-05-19_17:45:24
NOTE: Read stats gathered by markduplicate_stats.txt -- 2022-05-19_17:45:24
NOTE: Nanopore - bcftools mpileup used to call SNPs and make VCF files *** -- 2022-05-20_00:17:23
SYSTEM CALL: bcftools mpileup --threads 16 -Ou -f /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta MBWGS009_nodup.bam | bcftools call --threads 16 -mv -v -Ov -o MBWGS009_unfiltered_hapall.vcf -- 2022-05-20_00:17:23
SYSTEM CALL: vcffilter -f "QUAL > 20" MBWGS009_unfiltered_hapall.vcf > temp1.vcf -- 2022-05-20_00:17:23
NOTE: Nanopore QUAL values increased by 100 to obtain closer values seen with Illumina reads, and allowing VCF files from both platforms to be ran together. -- 2022-05-20_00:17:23
NOTE: Skipped unmapped read assembly -- 2022-05-20_00:17:23
IMPORT: VCF_Annotation(gbk_list=self.gbk, vcf_file=filtered_hapall) -- 2022-05-20_00:17:25
IMPORT: Zero_Coverage(FASTA=reference, bam=nodup_bamfile, vcf=filtered_hapall,) -- 2022-05-20_00:17:41
NOTE: Files moved to temp_dir and removed: *_unmapped*.fastq.gz, *_all.bam, *_fixmate.bam, *_pos_srt.bam, markduplicate_stats.txt, *.bai, *_filtered_hapall.vcf, *_mapfix_hapall.vcf, *_unfiltered_hapall.vcf, *_filtered_hapall_nanopore.vcf, *.sam, *.amb, *.ann, *.bwt, *.pac, *.fasta.sa, *_sorted.bam, *.dict, chrom_ranges.txt, *.fai, dup_metrics.csv -- 2022-05-20_00:17:41

Versions:
vSNP3: 3.06
Bio, 1.79
numpy, 1.22.3
pandas, 1.4.2
Minimap2: 2.24-r1122
Freebayes: v1.3.6
samtools 1.15
Using htslib 1.14

The main issue right now is that the mpileup step (using bcftools) takes about 5h per sample. I just can rerun all my samples with vSNP3 if it takes that long!

Here's the content of the Excel stats file:

sample  date    FASTA/s Sourmash Sequence Similarity    Found_Reference_Set FASTQ_R1    R1 File Size    R1 Read Count   R1 Length Sum   R1 Min Length   R1 Ave Length   R1 Max Length   R1 Passing Q20  R1 Passing Q30  R1 Read Quality Ave Spoligotype Spacer Counts   Spoligotype Binary Code Spoligotype Octal Code  Spoligotype SB Number   Groups  Aligner Mapped Paired Reads Mapped Single Reads Unmapped Reads  Unmapped Percent    Unmapped Assembled Contigs  Duplicate Paired Reads  Duplicate Single Reads  Duplicate Percent of Mapped Reads   BAM/Reference File  Reference Length    Genome with Coverage    Average Depth   No Coverage Bases   Percent Ref with Zero Coverage  Quality SNPs
MBWGS009    2022-05-19_17-40-28 NC_002945v4.fasta   3.9%:3b48a55512e8dedc2b8d6e33699893bd   Mycobacterium_AF2122    MBWGS009.fastq.gz   248.4 MB    74,874  262,465,587 1   3,505.4 36,224  65.27%  36.07%  13.8717 20:23:0:27:0:24:26:24:0:28:0:16:26:26:26:0:28:32:0:23:27:28:36:32:36:43:35:35:31:0:0:0:0:0:32:38:36:35:0:0:0:0:0    binary-1101011101011110110111111111100000111100000  octal-656573377603600   SB1071  group file not provided Minimap2    0   74,847  2,725   3.5%    skipped assembly    0   442 0.6%    MBWGS009_nodup.bam made with NC_002945v4    4,349,904   99.81%  59.1X   8,295   0.190694%   596

So any plans on improving support for Nanopore? I actually haven't tested vSNP3 on paired end data yet, so I don't know if the speed problem is only Nanopore related or not. Let me know if you need more info.

Thanks! Marco

stuber commented 2 years ago

Thanks for checking out vsnp3 and sending issues seen.

I've had inconsistent results with vcftools and bcftools. I typically see bcftools installed via the freebayes requirement so have left it out from explicit requirement list. Same with vcflib for vcftools. I've fought with conda installing bcftools as a Python 2 tool when asking for Python 3 when specifying the install explicitly. I've had best results leaving them out of the explicit requirements and letting them be installed as requirements of freebayes and vcflib. Same with the libcrypto (and other libraries). Other than having comments like this here to help other users, I am convinced that because everyone's environment is slightly different conda may require troubleshooting to either "fix" a user's environment or to fix something being overlooked by conda. That being said I should look at replacing these tools since they're often problematic. I did this for pysam/samtools. These tools would often (but not always) cause conflicting libraries, so pysam was removed from vsnp3. I will be working soon to provide vsnp3 as a container. Hopefully this will ease installation, or at least provide another option.

Nanopore is beta at best. Especially since the technology is steadily changing. Can you share the FASTQ file you're using? If so I would like to troubleshoot.

Sourmash runs quick and I like seeing the "best reference" even when specifying. I should change my wording so there isn't confusion. It should still be using the reference you specified. I'm going to update the wording.

I would like to improve Nanopore support. This has been a first test at seeing how it may work, but the datasets tried so far have been few. This input is good to get.