bmvdgeijn / WASP

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery
Apache License 2.0
102 stars 51 forks source link

find_intersecting_snps.py #107

Open LunaPatton opened 3 years ago

LunaPatton commented 3 years ago

Hi, Recently I plan to perform ASE analysis from RNA-Seq data, after finished alignment by STAR, I start to run WASP for removing the mapping bias, while met with many problem. Based on the instruction, I prepared VCF file (from ensembl, i.e., bos_taurus_incl_consequences.vcf.gz) and chromsome size file (from ucsc) for cattle like below: chromInfo.txt chr1 158534110 chrX 139009144 chr2 136231102 chr3 121005158 ......

VCF file

CHROM POS ID REF ALT QUAL FILTER INFO

1 207381 rs719849645 C T . . dbSNP_150;TSA=SNV;VE=intergenic_variant;CSQ=T|intergenic_variant|||| 1 207419 rs517357290 G A . . dbSNP_150;TSA=SNV;VE=intergenic_variant;CSQ=A|intergenic_variant|||| ...... I created hdf5 files with below codes:

$WASP/snp2h5/snp2h5 \ --chrom ./ucsc_cattle_chromInfo.txt \ --chrom $outdir/ref/cattle_ase/cattle_chromInfo.txt \
--format vcf \ --snp_index cattle_snp_index.h5 \ --snp_tab cattle_snp_tab.h5 \ ./bos_taurus_incl_consequences.vcf.gz --haplotype cattle_haplotypes.h5 \

and here is output log file

Wed Apr 21 21:38:23 CST 2021 long alleles will be truncated to 100bp writing haplotypes to: cattle_haplotypes.h5 writing SNP index to: cattle_snp_index.h5 writing SNP table to: cattle_snp_tab.h5 WARNING: VCF contains fewer header columns in #CHROM line than expected missing headers: FORMAT WARNING: VCF header contains no sample identifiers in #CHROM line WARNING: did not find chromosome matching name '1', trying again with name 'chr1' chromosome: chr1, length: 158534110bp reading from file ./bos_taurus_incl_consequences.vcf.gz counting lines in file total lines: 10639414 reading VCF header WARNING: VCF contains fewer header columns in #CHROM line than expected missing headers: FORMAT WARNING: VCF header contains no sample identifiers in #CHROM line VCF header lines: 22 number of samples: 0 WARNING: VCF file has 0 samples, cannot write haplotypes parsing file and writing to HDF5 files .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ERROR: expected at least 9 tokens per line (line: 10639415)

I was so confused to understand the instruction, maybe I got wrong VCF file? @gmcvicker

Best regards Zoe

gmcvicker commented 3 years ago

Hi Zoe,

Normally the VCF header contains more columns. Specifically your header is missing a FORMAT column, as well as columns corresponding to the sample names. If your VCF only has a single sample, it should still have a sample name. For example a typical header line might look like this, where GM12878 is the sample name:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GM12878 In addition, the rows of the VCF would typically contain at least 9 columns, with the column preceding the sample information being the FORMAT column that describes what information provided for each sample. Here is an example row from a VCF that I have on hand:

22 16101526 . A G 23.79 PASS AC=1;AF=0.500;AN=2;AS_BaseQRankSum=2.300;AS_FS=0.000;AS_MQ=60.00;AS_MQRankSum=1.600;AS_QD=2.16;AS_ReadPosRankSum=-1.000;AS_SOR=0.086;BaseQRankSum=2.36;ClippingRankSum=0.00;DP=11;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=54.58;MQRankSum=1.60;QD=2.16;ReadPosRankSum=-9.430e-01;SOR=0.086;VQSLOD=1.24;culprit=QD GT:AD:DP:GQ:PL 0/1:9,2:11:52:52,0,319

Hope this helps,

Graham

LunaPatton commented 3 years ago

Thanks a lot Graham ! I will try again and give you the feedback.

By the way, I noticed there is --waspOutputMode in STAR software, what's the difference bewteen WASP and STAR if I run this function with STAR?

Best regards Zoe

LunaPatton commented 3 years ago

Hi Graham, I followed your suggestion and modified my VCF file, then I ran below code to create HDF5 file:

$WASP/snp2h5/snp2h5 \ --chrom ./cattle_chromInfo.txt.gz \ --format vcf \ --snp_index cattle_snp_index.h5 \ --snp_tab cattle_snp_tab.h5 \ --haplotype cattle_haplotypes.h5 \ ./cattle_anno.vcf.gz

The log file like this: Loading WASP/0.3.4   Loading requirement: HDF5/1.8.21 long alleles will be truncated to 100bp writing haplotypes to: cattle_haplotypes.h5 writing SNP index to: cattle_snp_index.h5 writing SNP table to: cattle_snp_tab.h5 chromosome: 1, length: 158534110bp reading from file ./cattle_anno.vcf.gz counting lines in file   total lines: 97127202 reading VCF header   VCF header lines: 1   number of samples: 1 initializing HDF5 matrix with dimension: (97127201, 2) initializing HDF5 matrix with dimension: (97127201, 1) parsing file and writing to HDF5 files WARNING: vcf.c:266: some genotypes are unphased (delimited with '/' instead of '|') ...................................................................................................................................... chromosome: 2, length: 136231102bp   no data for this chromosome chromosome: 3, length: 121005158bp   no data for this chromosome

.........

Here is my VCF file:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE

1       207381  rs719849645     C       T       .       .       dbSNP_150;TSA=SNV;VE=intergenic_variant;CSQ=T|intergenic_variant||||      GT      0/1


Could you please point out where I go wrong? 

Best wishes Zoe

------------------ 原始邮件 ------------------ 发件人: "bmvdgeijn/WASP" @.>; 发送时间: 2021年4月23日(星期五) 中午12:08 @.>; @.**@.>; 主题: Re: [bmvdgeijn/WASP] find_intersecting_snps.py (#107)

Hi Zoe,

Normally the VCF header contains more columns. Specifically your header is missing a FORMAT column, as well as columns corresponding to the sample names. If your VCF only has a single sample, it should still have a sample name. For example a typical header line might look like this, where GM12878 is the sample name:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GM12878

In addition, the rows of the VCF would typically contain at least 9 columns, with the column preceding the sample information being the FORMAT column that describes what information provided for each sample. Here is an example row from a VCF that I have on hand:

22 16101526 . A G 23.79 PASS AC=1;AF=0.500;AN=2;AS_BaseQRankSum=2.300;AS_FS=0.000;AS_MQ=60.00;AS_MQRankSum=1.600;AS_QD=2.16;AS_ReadPosRankSum=-1.000;AS_SOR=0.086;BaseQRankSum=2.36;ClippingRankSum=0.00;DP=11;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=54.58;MQRankSum=1.60;QD=2.16;ReadPosRankSum=-9.430e-01;SOR=0.086;VQSLOD=1.24;culprit=QD GT:AD:DP:GQ:PL 0/1:9,2:11:52:52,0,319

Hope this helps,

Graham

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.