intrahost.py error - Githubissues

mhopken commented 4 years ago

Hello,

I am having a problem with intrahost.py. Here is the command and error.

intrahost.py vphaser_one_sample AH0003550S.4.A.mka_sort.bam AH0003550S.4.A.fasta AH0003550S.4.A 2019-10-14 13:23:12,452 - cmd:197:main_argparse - INFO - software version: 1.24.0, python version: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] 2019-10-14 13:23:12,453 - cmd:199:main_argparse - INFO - command: /home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py vphaser_one_sample inBam=AH0003550S.4.A.mka_sort.bam inConsFasta=AH0003550S.4.A.fasta outTab=AH0003550S.4.A vphaserNumThreads=None minReadsEach=5 maxBias=10 removeDoublyMappedReads=False loglevel=INFO Traceback (most recent call last): File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 1203, in util.cmd.main_argparse(commands, doc) File "/home/matt_hopken/software/viral-ngs-etc/conda-env/opt/viral-ngs-1.24.0/util/cmd.py", line 224, in main_argparse ret = args.func_main(args) File "/home/matt_hopken/software/viral-ngs-etc/conda-env/opt/viral-ngs-1.24.0/util/cmd.py", line 106, in _main mainfunc(**args2) File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 150, in vphaser_one_sample for row in libraryFilteredIter: File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 190, in compute_library_bias rgs_by_lib = sorted((rg['LB'], rg['ID']) for rg in samtoolsTool.getReadGroups(inBam).values()) File "/home/matt_hopken/software/viral-ngs-etc/conda-env/bin/intrahost.py", line 190, in rgs_by_lib = sorted((rg['LB'], rg['ID']) for rg in samtoolsTool.getReadGroups(inBam).values()) KeyError: 'LB'

I am using the following OS: NAME="Ubuntu" VERSION="18.04.3 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.3 LTS" VERSION_ID="18.04"

Thank you, Matt

tomkinsc commented 4 years ago

Hi, can you please post the header for the bam file you are using as input? It can be dumped via samtools view -H file.bam

mhopken commented 4 years ago

Thank you for your response. I aligned with MOSAIK. @HD VN:1.0 SO:coordinate @SQ SN:AH0003550S.4.A LN:1765 M5:f752b5d274e09bb0fbae7bc07535c3f2 @RG ID:104T2YX4X13Q SM:unknown PI:163 PL:illumina @PG ID:MosaikAligner VN:2.2.26 CL:MosaikAligner -p 60 -in AH0003550S_S24_L001_read.mkb -out AH0003550S.4.A.mka -ia AH0003550S.4.A.dat -annpe src/networkFile/2.1.78.pe.ann -annse src/networkFile/2.1.78.se.ann

tomkinsc commented 4 years ago

Thanks for posting the header. In this case, the KeyError: 'LB' is accurate; at present, viral-ngs expects to see a library identifier (i.e. an LB value) in each read group (i.e.@rg line) of the bam file. Even though it's technically an optional sam/bam field, vphaser and viral-ngs both use use it to keep track of separate libraries for the purpose of reporting library bias, the number of libraries, etc.

We'll work on better handling of bam files with missing LB fields—or at least more explanatory error reporting—but for now to get you up and running I'd suggest adding a library identifier to each read group. The LB ID can be any unique value corresponding to the library, though by convention spaces are omitted. If the sample name, SM:sample_1, something like LB:sample_1_lib1 would work.

The easiest way to alter the read groups in this case may be via picard's AddOrReplaceReadGroups.

mhopken commented 4 years ago

Thank you very much for this assistance! I will give this a try.

I have once more question. For the tabfile_rename function, what is the mapFile and how would I structure that?

mhopken commented 4 years ago

I got V-Phaser2 to run within the Viral_NGS environment. Thank you for the assistance! If you could please help me understand the mapFile then I can process through to a vcf for my statistical analyses.

Thanks again!

tomkinsc commented 4 years ago

The function intrahost.py tabfile_rename can be used to perform a simple find/replace of values in a tab-delimited file. The mapFile argument is a text file with two tab-delimited columns (the value to find,the value to replace it with).

To go from the vphaser output to a VCF, I'd suggest you use the following functions (example for two samples): Align the sample assemblies to a reference genome via MAFFT:

interhost.py multichr_mafft \
      reference.fasta sample1_assembly.fasta sample2_assembly.fasta \ #each fasta should have all segments of the genome
      . \
      --outFilePrefix align_mafft-reference \
      --preservecase \
      --localpair \
      --sampleNameListFile align_mafft-reference-sample_names.txt

Create a VCF from the alignment:

intrahost.py merge_to_vcf \
        reference.fasta \ # reference fasta containing sequences for all segments
        isnvs.vcf.gz \
        --samples sample1 sample2 \ # values used in the assembly fasta headers
        --isnvs sample1_vphaser_output.txt sample2_vphaser_output.txt \
        --alignments align_mafft-reference.fasta \ # multiple samples per alignment file, one alignment per virus segment
        --strip_chr_version \ # removes segment num suffix if present
       --naive_filter # only specify where multiple libraries were used \
        --parse_accession

The VCF will have the following fields per-sample: GT:AF:DP:NL:LB, defined as:

GT: genotype call for the sample (consensus allele)
AF: minor allele frequencies (for each of the alleles listed in the ALT allele column)
DP: read depth
NL: number of libraries (one int per allele, including REF allele)
LB: library bias p-value from vphaser (one int per allele, including REF allele)

Annotate vcf via snpEff:

interhost.py snpEff \
        isnvs.vcf.gz \
        $snpRefAccessions \ # NCBI accessions of the reference genome segments
        isnvs.annot.vcf.gz \ # output file
        your_email@example.com # needed for querying genbank

Dump iSNVs to a table:

intrahost.py iSNV_table \
        isnvs.annot.vcf.gz \
        isnvs.annot.txt.gz

mhopken commented 4 years ago

I got this to work. Thank you very much for your assistance!

broadinstitute / viral-ngs

intrahost.py error #993