Open wangpenhok opened 1 year ago
I looked through the python script to prepare cosmic.vcf.gz for hg38 manually. Interestingly, I found the problem arises at the CMD bcftools norm --check-ref s --do-not-normalize -f {ref_file}
, which fails because the COSMIC.pre.vcf does not contain contig information in its header, similar to the problem described here: https://github.com/samtools/bcftools/issues/766. So, I tried bgip the COSMIC.pre.vcf and tabix it, and then the following steps went well until bcftools view -e 'SNP=1'
. But if you skip this cmd, you can finally run through all commands.
So my queston is the step bcftools view -e 'SNP=1'
important for annotation? And how could we fix this problem? I didn't see any info regarding "SNP" in the header of the original COSMIC.vcf files, possibly due to the updated version.
That command excludes the known SNPs (non-tumor) from the COSMIC data. I believe they don't ship these variants anymore, though.
Yeah , so I skipped this step as well. I manually prepared vcf files of CodingVariants and NonCodingVariants separately. This time, the problem arises when I tried to merge the above two vcf files with error indicating that the contigs of them are not compatible. It is strange that even if I normalized according to the same reference genome, the contigs of NonCodingVcf is ordered as "chr1, chr2, chr3, chr 4....." while the CodingVcf is ordered as "chr1, chr10, chr11, chr12....chr2, chr22..."
I cannot figure out what is wrong
Did you try reordering the contigs so they all have the same order?
Thanks for your advice, I may try this later. But it still confused me how come different orders are given after the same operation(exactly the same gunzip-normalize-gsort process as is shown below)
gunzip -c CosmicNonCodingVariants.vcf.gz | sed "s/^\([0-9]\+\)\t/chr\1\t/g" | sed "s/^MT/chrM/g" | sed "s/^X/chrX/g" | sed "s/^Y/chrY/g" > CosmicNonCodingVariants-fix-chrom.vcf.gz
bgzip -c CosmicNonCodingVariants-fix-chrom.vcf.gz > CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz
tabix CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz
bcftools norm --check-ref s --do-not-normalize -f /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz > CosmicNonCodingVariants-fix-chrom-bgzip-norm.vcf.gz
gsort --memory 20000 CosmicNonCodingVariants-fix-chrom-bgzip-norm.vcf.gz /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.fai > CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz
grep -v ^##contig CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz | bcftools annotate -h ./v97/GRCh38/CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort-contig_header.txt
bgzip -c CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz > CosmicNonCodingVariants-prep.vcf.gz
tabix CosmicNonCodingVariants-prep.vcf.gz
mv CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort-contig_header.txt CosmicNonCodingVariants-prep-contig_header.txt
Version info
To Reproduce Exact bcbio command you have used:
Log files (could be found in work/log)
`2023-02-13 14:22:35,700 [I] Beginning COSMIC v94 prep for GRCh37. 2023-02-13 14:22:35,700 [I] Beginning COSMIC v94 prep for GRCh38. 2023-02-13 14:22:35,700 [I] Downloading COSMIC VCF files. 2023-02-13 14:22:35,700 [I] Downloading https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz 2023-02-13 14:25:13,610 [I] Downloading https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/VCF/CosmicNonCodingVariants.vcf.gz 2023-02-13 14:26:18,927 [I] Sorting v94/GRCh38/CosmicCodingMuts.vcf.gz to match the order of /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.
I have also tried COSMIC version 93 and earlier versions such as v90 , but it seems the VCF files are not available anymore for these versions. Furthermore, for version from v94 on, though the vcf files could be downloaded successfully, the following process always halted because of error:
Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
I assume format changes occurred to these newer versions of vcf files. Could you please update the bcbio_nextgen.py script so as to avoid such bugs? Thanks ~