ValueError: Invalid VariantRecord. Number of samples does not match header

YPGG1234 commented 3 years ago

Hello,

Thanks for your contribution to Paragraph, it's a nice software for population analysis. Recently I am using Paragraph to genotyping some animals Illumina data, but when I run the multigrmpy.py, I got the mistake:

  vcf_out.write(record)
  File "pysam/libcbcf.pyx", line 4400, in pysam.libcbcf.VariantFile.write
  File "pysam/libcbcf.pyx", line 4426, in pysam.libcbcf.VariantFile.write
ValueError: Invalid VariantRecord.  Number of samples does not match header (0 vs 1)

My VCF file like this, it was generated from Sniffles and processed by custom script in order to be better used by Paragraph:

##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=MAPQ,Number=1,Type=Integer,Description="Median mapping quality of paired-ends">
##INFO=<ID=RE,Number=1,Type=Integer,Description="read support">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation">
##INFO=<ID=UNRESOLVED,Number=0,Type=Flag,Description="An insertion that is longer than the read and thus we cannot predict the full size.">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of the SV">
##INFO=<ID=REF_strand,Number=2,Type=Integer,Description="Length of the SV">
##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Type of approach used to detect SV">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SEQ,Number=1,Type=String,Description="Extracted sequence from the best representative read.">
##INFO=<ID=STD_quant_start,Number=A,Type=Float,Description="STD of the start breakpoints across the reads.">
##INFO=<ID=STD_quant_stop,Number=A,Type=Float,Description="STD of the stop breakpoints across the reads.">
##INFO=<ID=Kurtosis_quant_start,Number=A,Type=Float,Description="Kurtosis value of the start breakpoints across the reads.">
##INFO=<ID=Kurtosis_quant_stop,Number=A,Type=Float,Description="Kurtosis value of the stop breakpoints across the reads.">
##INFO=<ID=SUPTYPE,Number=A,Type=String,Description="Type by which the variant is supported.(SR,ALN,NR)">
##INFO=<ID=STRANDS,Number=A,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency.">
##INFO=<ID=ZMW,Number=A,Type=Integer,Description="Number of ZMWs (Pacbio) supporting SV.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference reads">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant reads">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ./frisen.md.bam
1       4157    0       AGTTGTT A       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=4163;ZMW=8;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=-nan;SVTY
1       4227    1       T       ATTGTGTGTGTGTGTGTGTG    .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=4227;ZMW=7;STD_quant_start=0.000000;STD_quant_stop=1.195229;Kurtosis_quant_start=-nan;Kurtosis_quan
1       4255    2       T       TTGG    .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=4255;ZMW=8;STD_quant_start=0.000000;STD_quant_stop=0.500000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=1.000000;
1       5183    3       T       CCA     .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=5183;ZMW=9;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=-nan;SVTY
1       5859    4       TCCCCCCCGCCC    T       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=5870;ZMW=9;STD_quant_start=0.000000;STD_quant_stop=0.666667;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=-
1       7699    5       G       GGC     .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=7699;ZMW=10;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=-nan;SVT
1       8644    6       AGTTGTT A       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=8650;ZMW=13;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=-nan;SVT
1       8714    7       T       ATTGTGTGTGTGTGTGTGTGTGTG        .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=8714;ZMW=13;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurt
1       9675    8       T       CCA     .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=9675;ZMW=14;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=-nan;SVT
1       10354   9       TCCCCCCCC       T       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=10362;ZMW=16;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop
1       29452   10      A       CCCGCAGCCCCCGCAGCC      .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=29452;ZMW=17;STD_quant_start=0.000000;STD_quant_stop=0.632456;Kurtosis_quant_start=0.145636;Kurtosi
1       38059   12      G       GGA     .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=38059;ZMW=19;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=7.000000;Kurtosis_quant_stop=-na
1       62806   14      TAACACA T       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=62812;ZMW=28;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=0.500000;Kurtosis_quant_stop=0.6
1       72807   16      GGGA    G       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=72810;ZMW=28;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=4.000000;Kurtosis_quant_stop=-na

I have no idea about this case cause I can't find any information about the "Number of samples" either in VCF body or header, could you please help me?

I have one other question: Suppose I have ten samples PacBio HiFi data, should I first use Sniffles to call SVs of each of them following by the "Population-scale genotyping" pipeline, after all of them are finished, merge all VCF files into one big VCF file; Or first merge all Sniffles VCFs into one (Using SURVIVOR), then follow the "Population-scale genotyping" pipeline. Which one is better?

Looking froward to your kind reply.

traxexx commented 3 years ago

In your VCF, the "#CHROM..." line has 10 fields but your VCF lines only have 8 fields.

If SURVIVOR doesn't require population genotypes (or other population metrics) for merging your records, then merge first before doing genotyping. If it does require population information...it depends on your computational resource. I may still recommend merging first, because, with many different records for one SV, you'll need to perform genotyping on each sample multiple times. This can be computationally expensive.

YPGG1234 commented 3 years ago

Thank you for your prompt reply!

Illumina / paragraph

ValueError: Invalid VariantRecord. Number of samples does not match header #63