AlexandrovLab / SigProfilerMatrixGenerator

SigProfilerMatrixGenerator creates mutational matrices for all types of somatic mutations. It allows downsizing the generated mutations only to parts for the genome (e.g., exome or a custom BED file). The tool seamlessly integrates with other SigProfiler tools.
BSD 2-Clause "Simplified" License
101 stars 37 forks source link

How to get ICGC-like VCF Format #56

Closed sahuno closed 3 years ago

sahuno commented 3 years ago

please sigProfilerExtractor will not run with my vcf file. How can i convert my vcf file to the ICGC-like vcf format you mentioned in the manual? I'm searching online but haven't got any hints yet.

INPUT FILE FORMAT This tool currently supports maf, vcf, simple text file, and ICGC formats. The user must provide variant data adhering to one of these four formats. If the user’s files are in vcf format, each sample must be saved as a separate files.

Below is my error message;


>>> sig.sigProfilerExtractor("vcf","mutect2_withFiltersTerra_vcf",data, minimum_signatures=1, maximum_signatures=4,reference_genome="GRCh38",opportunity_genome="GRCh38", exome= True)

************** Reported Current Memory Use: 0.17 GB *****************
File format not supported
>>> 

Here is a snippet of my vcf (without headers)

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SC173874    SC173909
chr1    35816826    .   G   A   .   PASS    ALIGN_DIFF=63;AS_FilterStatus=SITE;AS_SB_TABLE=57,27|3,1;DP=95;ECNT=1;FUNCOTATION=[AGO4|hg38|chr1|35816826|35816826|INTRON||SNP|G|G|A|g.chr1:35816826G>A|ENST00000373210.3|+|||c.e2-56G>A|||0.42643391521197005|AAAAAAGAAAGAAAGAAAGAA|||||||||||||||||||||||||||||CH471059|NM_017629.3|NP_060099|HGNC:18424|argonaute_%20_4_%2C__%20_RISC_%20_catalytic_%20_component|Approved|gene_%20_with_%20_protein_%20_product|protein-coding_%20_gene|EIF2C4|"eukaryotic_%20_translation_%20_initiation_%20_factor_%20_2C_%2C__%20_4"_%2C__%20_"argonaute_%20_RISC_%20_catalytic_%20_component_%20_4"|hAGO4_%2C__%20_KIAA1567_%2C__%20_FLJ20033|"argonaute_%20_4"|1p34.3|2016-10-05|2013-02-15|2015-11-27|AB046787||192670|ENSG00000134698|12906857|NM_017629|408|Argonaute/PIWI_%20_family|CCDS397|OTTHUMG00000004243|192670|607356|NM_017629|Q9HCK5|ENSG00000134698|uc001bzj.3||||AGO4_HUMAN||A7MD27|Q9HCK5|epidermal_%20_growth_%20_factor_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0007173)_%7C_Fc-epsilon_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0038095)_%7C_fibroblast_%20_growth_%20_factor_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0008543)_%7C_gene_%20_expression_%20_(GO:0010467)_%7C_innate_%20_immune_%20_response_%20_(GO:0045087)_%7C_mRNA_%20_catabolic_%20_process_%20_(GO:0006402)_%7C_negative_%20_regulation_%20_of_%20_translation_%20_involved_%20_in_%20_gene_%20_silencing_%20_by_%20_miRNA_%20_(GO:0035278)_%7C_neurotrophin_%20_TRK_%20_receptor_%20_signaling_%20_pathway_%20_(GO:0048011)_%7C_Notch_%20_signaling_%20_pathway_%20_(GO:0007219)_%7C_phosphatidylinositol-mediated_%20_signaling_%20_(GO:0048015)|cytoplasmic_%20_mRNA_%20_processing_%20_body_%20_(GO:0000932)_%7C_cytosol_%20_(GO:0005829)_%7C_membrane_%20_(GO:0016020)_%7C_micro-ribonucleoprotein_%20_complex_%20_(GO:0035068)_%7C_RISC_%20_complex_%20_(GO:0016442)|miRNA_%20_binding_%20_(GO:0035198)|||||||||||||||||||false|false||false|false||false|false|false||false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|||false|false||false||false||false|false|false||false|||false||||||||||||||||||||||||||||||||||||||||||||3.60985e-04|5.05433e-04|5.98086e-04|4.37637e-04|0.00000e+00|0.00000e+00|0.00000e+00|3.81679e-03|0.00000e+00|5.15464e-03|0.00000e+00|0.00000e+00|0.00000e+00|3.28192e-04|4.04531e-04|7.66871e-04|0.00000e+00|3.86747e-04|2.88060e-04|0.00000e+00|1.63666e-04|3.85802e-04|2.48942e-04|5.29101e-04|1.61290e-02|0.00000e+00|0.00000e+00|0.00000e+00|5.05433e-04|1.55971e-03||1|36282427|false|false|rs1296156478|RF|SC173909|SC173874|Unknown|CRC_PR];GERMQ=93;MBQ=30,20;MFRL=233,315;MMQ=60,50;MPOS=27;NALIGNS=53;NALOD=1.21;NLOD=12.72;POPAF=2.71;ROQ=12;TLOD=6.51;UNITIGS=124    GT:AD:AF:DP:F1R2:F2R1:SB    0/1:36,4:0.109:40:9,0:21,4:26,10,3,1    0/0:48,0:0.021:48:23,0:23,0:31,17,0,0
chr1    48471987    .   G   T   .   PASS    AS_FilterStatus=SITE;AS_SB_TABLE=61,82|3,2;DP=155;ECNT=1;FUNCOTATION=[SPATA6|hg38|chr1|48471987|48471987|MISSENSE||SNP|G|G|T|g.chr1:48471987G>T|ENST00000371847.7|-|1|187|c.22C>A|c.(22-24)Cag>Aag|p.Q8K|0.71571072319202|AGGGCGCACTGCAGCGCCTTC|SPATA6_ENST00000371843.7_MISSENSE_p.Q8K/SPATA6_ENST00000396199.7_MISSENSE_p.Q8K||||||||||||||||||||91|biliary_tract(2)_%7C_breast(12)_%7C_central_nervous_system(44)_%7C_large_intestine(11)_%7C_pancreas(22)|||||||HM005491|NM_019073.3|NP_061946|HGNC:18309|spermatogenesis_%20_associated_%20_6|Approved|gene_%20_with_%20_protein_%20_product|protein-coding_%20_gene|||SRF1_%2C__%20_FLJ10007_%2C__%20_SRF-1|"spermatogenesis-related_%20_factor-1"|1p33|2016-10-05|||AK000869||54558|ENSG00000132122||NM_019073|||CCDS551_%2C__%20_CCDS65535_%2C__%20_CCDS72787|OTTHUMG00000007794|54558|613947|NM_001286238|Q9NWH7|ENSG00000132122|uc001crr.4||||SPAT6_HUMAN||Q5T3N7_%7C_Q8WUE6|Q9NWH7|cell_%20_differentiation_%20_(GO:0030154)_%7C_multicellular_%20_organismal_%20_development_%20_(GO:0007275)_%7C_spermatogenesis_%20_(GO:0007283)|extracellular_%20_region_%20_(GO:0005576)||||||||||||||||||||false|false||false|false||false|false|false||false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|||false|false||false||false||false|false|false||false|||false|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||false|false|||SC173909|SC173874|Unknown|CRC_PR];GERMQ=93;MBQ=30,20;MFRL=248,175;MMQ=60,60;MPOS=60;NALIGNS=1;NALOD=1.89;NLOD=22.87;POPAF=6.00;ROQ=24;TLOD=7.51;UNITIGS=205    GT:AD:AF:DP:F1R2:F2R1:SB    0/1:51,5:0.087:56:27,3:24,2:23,28,3,2   0/0:92,0:0.013:92:36,0:56,0:38,54,0,0
ebergstr commented 3 years ago

This message arises when there is unexpected file extension. Your file is in a VCF format, therefore, the file extension needs to be ".vcf". If you still experience an issue, please ensure that you do not have any hidden extension associated with the file(s). You will want to delete the input folder before rerunning with the new extensions.

sahuno commented 3 years ago

@ebergstr , yes you are right! My input folder had both .vcfand .vcf.idx files. I had to remove the .vcf.idxfiles from the input folder. Thanks, it works now!