Closed WashingtondaSilva closed 4 years ago
Hi Washington,
Yeah, that is because LoFreq doesn't have a sample concept, hence the lack of a SAMPLES column and corresponding FORMAT column. You will have to add that manually I'm afraid, but it's actually not hard:
Andreas
On 26 January 2018 at 01:14, Washington Luis da Silva < notifications@github.com> wrote:
Hi there,
I am trying to run SNPgenie to calculate nucleotide diversity on my VCF file from a virus population study. SNPgenie has the following requirement for VCF 4 files;
"FORMAT (4): --vcfformat=4. Like formats 2 and 3, variants have been called from a pooled deep-sequencing sample containing genomes from multiple individuals. For this format, SNPGenie will require AD and DP data in the FORMAT and columns, where FORMAT is the final column header before the column(s) begins. The order of the data keys in the FORMAT column and the data values in the columns must be preserved:
-
For reference allele depth, include the AD tag in the FORMAT column, which refers to the read depth value for each allele in the column(s), with values for variant allele(s) in the same order as listed in the ALT column (e.g., "AD" in the FORMAT column and "75,77" in the column);
For coverage (total read depth), include DP in the FORMAT column, which refers to the total read depth in the column(s) (e.g., "DP" in the FORMAT column and "152" in the column). As usual, you will want to make sure to maintain the VCF file's features, such as TAB(\t)-delimited columns. Unlike some other formats, the allele frequency in VCF is a decimal."
However, the vcf file from lofreq is missing FORMAT column, please see a head from lofreq output bellow.
*"##fileformat=VCFv4.0 ##fileDate=20180118 ##source=lofreq call -f /home/dasilva/Desktop/NGS_files/cns_source_Mont.fa -o SNVs.vcf GATK.bam
reference=/home/dasilva/Desktop/NGS_files/cns_source_Mont.fa
INFO=
INFO=
INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias
at this position"> ##INFO=
INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant
is an INDEL."> ##INFO=
##INFO= FILTER=
FILTER=<ID=sb_fdr,Description="Strand-Bias Multiple Testing Correction:
fdr corr. pvalue > 0.001000">
FILTER=
FILTER=<ID=min_indelqual_20,Description="Minimum Indel Quality (Phred)
20"> #CHROM POS ID REF ALT QUAL FILTER INFO cns_source_Mont 188 . C T 302 PASS DP=1620;AF=0.027160;SB=0;DP4=1552,3,44,0 cns_source_Mont 189 . A G 122 PASS DP=1624;AF=0.012315;SB=0;DP4=1517,3,20,0*
Is there anything I can do to change the lofreq vcf file output to mach the requirements from SNPgenie?
Thanks a bunch,
-Washington
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CSB5/lofreq/issues/57, or mute the thread https://github.com/notifications/unsubscribe-auth/ABC5CQFJtPUVdz0LiCCl3YN6fz-5qNm9ks5tOLZngaJpZM4RtMCH .
-- Andreas Wilm andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Awesome,
Thanks a bunch Andreas ;)
-Washington
Hi there,
I am trying to run SNPgenie to calculate nucleotide diversity on my VCF file from a virus population study. SNPgenie has the following requirement for VCF 4 files;
"FORMAT (4): --vcfformat=4. Like formats 2 and 3, variants have been called from a pooled deep-sequencing sample containing genomes from multiple individuals. For this format, SNPGenie will require AD and DP data in the FORMAT and columns, where FORMAT is the final column header before the column(s) begins. The order of the data keys in the FORMAT column and the data values in the columns must be preserved:
For reference allele depth, include the AD tag in the FORMAT column, which refers to the read depth value for each allele in the column(s), with values for variant allele(s) in the same order as listed in the ALT column (e.g., "AD" in the FORMAT column and "75,77" in the column);
For coverage (total read depth), include DP in the FORMAT column, which refers to the total read depth in the column(s) (e.g., "DP" in the FORMAT column and "152" in the column).
As usual, you will want to make sure to maintain the VCF file's features, such as TAB(\t)-delimited columns. Unlike some other formats, the allele frequency in VCF is a decimal."
However, the vcf file from lofreq is missing FORMAT column, please see a head from lofreq output bellow.
_"##fileformat=VCFv4.0
fileDate=20180118
source=lofreq call -f /home/dasilva/Desktop/NGS_files/cns_source_Mont.fa -o SNVs.vcf GATK.bam
reference=/home/dasilva/Desktop/NGS_files/cns_source_Mont.fa
INFO=
INFO=
INFO=
INFO=
INFO=
INFO=
INFO=
FILTER=
FILTER= 0.001000">
FILTER=
FILTER=
CHROM POS ID REF ALT QUAL FILTER INFO
cns_source_Mont 188 . C T 302 PASS DP=1620;AF=0.027160;SB=0;DP4=1552,3,44,0 cns_sourceMont 189 . A G 122 PASS DP=1624;AF=0.012315;SB=0;DP4=1517,3,20,0
Is there anything I can do to change the lofreq vcf file output to mach the requirements from SNPgenie?
Thanks a bunch,
-Washington