alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

waspOutputMode: could not find any SNPs in VCF file #1243

Open hadasbi opened 3 years ago

hadasbi commented 3 years ago

Hi,

I'm trying to use the --waspOutputMode option on a dataset. I'm using a COSMIC vcf file, filtered for SNPs only. I added the prefix "chr" to the first column to make it match the chromosome names in my genome (Before I did that, I got the same error).

This is how I run star: STAR --runMode alignReads --genomeDir ../../.. --readFilesCommand zcat --readFilesIn ../fastq_files/R1.fastq.gz ../fastq_files/R2.fastq.gz --outFileNamePrefix init --outSAMtype BAM Unsorted SortedByCoordinate --outSAMattributes NH HI AS nM NM MD jM jI XS MC ch vA vG vW --varVCFfile CosmicCodingMuts_SNPs.vcf --waspOutputMode SAMtag

I get the error: EXITING because of FATAL INPUT FILE ERROR: could not find any SNPs in VCF file: CosmicCodingMuts_SNPs.vcf SOLUTION: check formatting of the VCF file; unzip VCF file or use process substitution.

This is how the vcf looks like: chr1 65797 COSV58737189 T C . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN23957695;SNP;CDS=c.9+224T>C;AA=p.?;HGVSC=ENST00000641515.2:c.9+224T>C;HGVSG=1:g.65797T>C;CNT=1 chr1 66041 COSV58737025 A G . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN24406961;CDS=c.9+468A>G;AA=p.?;HGVSC=ENST00000641515.2:c.9+468A>G;HGVSG=1:g.66041A>G;CNT=1 chr1 66131 COSV58737120 C G . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN17493387;CDS=c.9+558C>G;AA=p.?;HGVSC=ENST00000641515.2:c.9+558C>G;HGVSG=1:g.66131C>G;CNT=1 chr1 66312 COSV58736797 A T . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN7277204;CDS=c.9+739A>T;AA=p.?;HGVSC=ENST00000641515.2:c.9+739A>T;HGVSG=1:g.66312A>T;CNT=1 chr1 66371 COSV58737143 A T . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN17943257;CDS=c.9+798A>T;AA=p.?;HGVSC=ENST00000641515.2:c.9+798A>T;HGVSG=1:g.66371A>T;CNT=1 chr1 66373 COSV58736978 A T . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN4902613;CDS=c.9+800A>T;AA=p.?;HGVSC=ENST00000641515.2:c.9+800A>T;HGVSG=1:g.66373A>T;CNT=2 chr1 66605 COSV58736995 T A . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN28548666;CDS=c.9+1032T>A;AA=p.?;HGVSC=ENST00000641515.2:c.9+1032T>A;HGVSG=1:g.66605T>A;CNT=1 chr1 66795 COSV58737166 T A . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN28504759;CDS=c.9+1222T>A;AA=p.?;HGVSC=ENST00000641515.2:c.9+1222T>A;HGVSG=1:g.66795T>A;CNT=1 chr1 66814 COSV58737064 A G . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN22517283;SNP;CDS=c.9+1241A>G;AA=p.?;HGVSC=ENST00000641515.2:c.9+1241A>G;HGVSG=1:g.66814A>G;CNT=2 chr1 66864 COSV58736928 C T . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN21650154;CDS=c.9+1291C>T;AA=p.?;HGVSC=ENST00000641515.2:c.9+1291C>T;HGVSG=1:g.66864C>T;CNT=1

Any idea what's wrong? Thanks, Hadas

lyj95618 commented 3 years ago

I think you need exactly 10 columns in the vcf file and the last column should be the genotype information (e.g. 1/0, 1/1)? This is the vcf format when I use for the STAR waspoutputMode

1 52727 . C G . PASS AC=1;AN=2 GT 1/0 1 53206 . G C . PASS AC=1;AN=2 GT 1/0

hadasbi commented 3 years ago

Thanks, I added these two columns (an entire column of "GT" and an entire column of "1/0") and the error did not appear. Can you please explain how does the WASP algorithm use this information? for example, if I put 1/0, but there's actually more than one possible alternative allele, would it report on both alleles (if found in the sequencing data) or would it ignore the second one?

alexdobin commented 3 years ago

Hi @hadasbi

the algorithm is designed for personal genomes, so it will only work with biallelic variants. The GT column (10th) should contain only two alleles.

Cheers Alex