Closed mhopken closed 4 years ago
Hi, can you please post the header for the bam file you are using as input? It can be dumped via samtools view -H file.bam
Thank you for your response. I aligned with MOSAIK. @HD VN:1.0 SO:coordinate @SQ SN:AH0003550S.4.A LN:1765 M5:f752b5d274e09bb0fbae7bc07535c3f2 @RG ID:104T2YX4X13Q SM:unknown PI:163 PL:illumina @PG ID:MosaikAligner VN:2.2.26 CL:MosaikAligner -p 60 -in AH0003550S_S24_L001_read.mkb -out AH0003550S.4.A.mka -ia AH0003550S.4.A.dat -annpe src/networkFile/2.1.78.pe.ann -annse src/networkFile/2.1.78.se.ann
Thanks for posting the header. In this case, the KeyError: 'LB'
is accurate; at present, viral-ngs expects to see a library identifier (i.e. an LB value) in each read group (i.e.@rg
line) of the bam file. Even though it's technically an optional sam/bam field, vphaser and viral-ngs both use use it to keep track of separate libraries for the purpose of reporting library bias, the number of libraries, etc.
We'll work on better handling of bam files with missing LB
fields—or at least more explanatory error reporting—but for now to get you up and running I'd suggest adding a library identifier to each read group. The LB ID can be any unique value corresponding to the library, though by convention spaces are omitted. If the sample name, SM:sample_1
, something like LB:sample_1_lib1
would work.
The easiest way to alter the read groups in this case may be via picard's AddOrReplaceReadGroups
.
Thank you very much for this assistance! I will give this a try.
I have once more question. For the tabfile_rename function, what is the mapFile and how would I structure that?
I got V-Phaser2 to run within the Viral_NGS environment. Thank you for the assistance! If you could please help me understand the mapFile then I can process through to a vcf for my statistical analyses.
Thanks again!
The function intrahost.py tabfile_rename
can be used to perform a simple find/replace of values in a tab-delimited file. The mapFile
argument is a text file with two tab-delimited columns (the value to find,the value to replace it with).
To go from the vphaser output to a VCF, I'd suggest you use the following functions (example for two samples): Align the sample assemblies to a reference genome via MAFFT:
interhost.py multichr_mafft \
reference.fasta sample1_assembly.fasta sample2_assembly.fasta \ #each fasta should have all segments of the genome
. \
--outFilePrefix align_mafft-reference \
--preservecase \
--localpair \
--sampleNameListFile align_mafft-reference-sample_names.txt
Create a VCF from the alignment:
intrahost.py merge_to_vcf \
reference.fasta \ # reference fasta containing sequences for all segments
isnvs.vcf.gz \
--samples sample1 sample2 \ # values used in the assembly fasta headers
--isnvs sample1_vphaser_output.txt sample2_vphaser_output.txt \
--alignments align_mafft-reference.fasta \ # multiple samples per alignment file, one alignment per virus segment
--strip_chr_version \ # removes segment num suffix if present
--naive_filter # only specify where multiple libraries were used \
--parse_accession
The VCF will have the following fields per-sample: GT:AF:DP:NL:LB
, defined as:
GT: genotype call for the sample (consensus allele)
AF: minor allele frequencies (for each of the alleles listed in the ALT allele column)
DP: read depth
NL: number of libraries (one int per allele, including REF allele)
LB: library bias p-value from vphaser (one int per allele, including REF allele)
Annotate vcf via snpEff:
interhost.py snpEff \
isnvs.vcf.gz \
$snpRefAccessions \ # NCBI accessions of the reference genome segments
isnvs.annot.vcf.gz \ # output file
your_email@example.com # needed for querying genbank
Dump iSNVs to a table:
intrahost.py iSNV_table \
isnvs.annot.vcf.gz \
isnvs.annot.txt.gz
I got this to work. Thank you very much for your assistance!
Hello,
I am having a problem with intrahost.py. Here is the command and error.
intrahost.py vphaser_one_sample AH0003550S.4.A.mka_sort.bam AH0003550S.4.A.fasta AH0003550S.4.A 2019-10-14 13:23:12,452 - cmd:197:main_argparse - INFO - software version: 1.24.0, python version: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] 2019-10-14 13:23:12,453 - cmd:199:main_argparse - INFO - command: /home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py vphaser_one_sample inBam=AH0003550S.4.A.mka_sort.bam inConsFasta=AH0003550S.4.A.fasta outTab=AH0003550S.4.A vphaserNumThreads=None minReadsEach=5 maxBias=10 removeDoublyMappedReads=False loglevel=INFO Traceback (most recent call last): File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 1203, in
util.cmd.main_argparse(commands, doc)
File "/home/matt_hopken/software/viral-ngs-etc/conda-env/opt/viral-ngs-1.24.0/util/cmd.py", line 224, in main_argparse
ret = args.func_main(args)
File "/home/matt_hopken/software/viral-ngs-etc/conda-env/opt/viral-ngs-1.24.0/util/cmd.py", line 106, in _main
mainfunc(**args2)
File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 150, in vphaser_one_sample
for row in libraryFilteredIter:
File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 190, in compute_library_bias
rgs_by_lib = sorted((rg['LB'], rg['ID']) for rg in samtoolsTool.getReadGroups(inBam).values())
File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 190, in
rgs_by_lib = sorted((rg['LB'], rg['ID']) for rg in samtoolsTool.getReadGroups(inBam).values())
KeyError: 'LB'
I am using the following OS: NAME="Ubuntu" VERSION="18.04.3 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.3 LTS" VERSION_ID="18.04"
Thank you, Matt