Illumina / strelka

Strelka2 germline and somatic small variant caller
GNU General Public License v3.0
357 stars 103 forks source link

Extracting some information from Strelka vcf #163

Open beginner984 opened 4 years ago

beginner984 commented 4 years ago

Hello

I have a script extracting some fields like TumorReadCount, tumorVariantAlleleCount, TumorReferenceAlleleCount, NormalReadCount, NormalAlleleCount, NormalReferenceCountand VAFfrom .vcf Strelka

This is my script

## Iterate over all files given at the command line
foreach my $inFile (@ARGV) {
  print STDERR "Processing $inFile...\n";
  my $outFile = "$inFile.parsed.txt";
  $outFile =~ m|([^/]+)$|;
  #/#Useless comment to fix SE syntax highlighting
  my $outFileName = $1;
  open(my $outFileHandle, '>', $outFile);
  open(my $inFileHandle, '<', $inFile);

  ## Print header
  print $outFileHandle "$outFileName\tChrom\tPosition\tRef\tAlt\tTumorReadCount\t" .
      "TumorVariantAlleleCount\tTumorReferenceAlleleCount\tNormalReadCount" .
          "\tNormalVariantAlleleCount\tNormalReferenceAlleleCount\tVAF\n";
  ## Read all lines of the input file
  while (<$inFileHandle>) {
    ## Skip headers
    next if /^#/;
    ## Get VCF fields
    my ($chrom, $pos, $name, $ref, $alt, $qual,
        $filter, $info, $format, $values) = split(/\t/);
    $info=~s/[^;]+=//g;
    my @infoFields = split(/;/, $info);
    print $outFileHandle (join "\t", $outFileName, $chrom, $pos, $ref, $alt,
                               $infoFields[12],$infoFields[25],
                               $infoFields[12]-$infoFields[25],
                               $infoFields[13],$infoFields[26],
                               $infoFields[13]-$infoFields[26],
                               $infoFields[27]) . "\n";
  }
}

This code works well on SNVs for example this file

https://sotonac-my.sharepoint.com/personal/fi1d18_soton_ac_uk/Documents/Microsoft%20Teams%20Chat%20Files/LP6008202-DNA_B03_vs_LP6008201-DNA_B03.snp.pass.vcf

But gives empty fields on INDELs like file

https://sotonac-my.sharepoint.com/personal/fi1d18_soton_ac_uk/Documents/Microsoft%20Teams%20Chat%20Files/LP6008202-DNA_B03_vs_LP6008201-DNA_B03.passed.somatic.indels.vcf

I really don't know how to change the code to work on INDELs

Can you please help me?

Thank you

SebastianHollizeck commented 4 years ago

I dont want to tell you how to do your thing, but if you can avoid parsing vcfs yourself, I think you should. Why not take one of the many libraries/modules available in basically any programming language to help you with it? For perl, which this script seesm to be vcftools provides a VCF.pm perl module, which could be very helpful. https://vcftools.github.io/perl_module.html

beginner984 commented 4 years ago

Thank you. In INDEL file there is not normal INFO

I noticed in some cases the frequency is above 100% or 1, is this because there is more information in support of the variant than there is read depth???. In these cases, could I consider the frequency to be around 1?