bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
988 stars 354 forks source link

COSMIC NEWER VERSION SUPPORT(v94 and later) #3698

Open wangpenhok opened 1 year ago

wangpenhok commented 1 year ago

Version info

To Reproduce Exact bcbio command you have used:

bcbio_python  prepare_cosmic.py 94  ~/

Log files (could be found in work/log)

`2023-02-13 14:22:35,700 [I] Beginning COSMIC v94 prep for GRCh37. 2023-02-13 14:22:35,700 [I] Beginning COSMIC v94 prep for GRCh38. 2023-02-13 14:22:35,700 [I] Downloading COSMIC VCF files. 2023-02-13 14:22:35,700 [I] Downloading https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz 2023-02-13 14:25:13,610 [I] Downloading https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/VCF/CosmicNonCodingVariants.vcf.gz 2023-02-13 14:26:18,927 [I] Sorting v94/GRCh38/CosmicCodingMuts.vcf.gz to match the order of /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.

gsort version 0.1.4 [W::vcf_parse] Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.) [E::vcf_format] Invalid BCF, CONTIG id=0 not present in the header [flush_buffer] Error: cannot write to - Failed to read from standard input: unknown file type panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x546d43] goroutine 1 [running]: bufio.NewReaderSize(...) /home/brentp/go/src/bufio/bufio.go:49 bufio.NewReader(...) /home/brentp/go/src/bufio/bufio.go:62 github.com/brentp/gsort.Sort(0x7ad380, 0x0, 0x7ad3a0, 0xc00061bb00, 0xc00061bac0, 0xaf0, 0x0, 0x0, 0x0) /home/brentp/go/go/src/github.com/brentp/gsort/gsort.go:116 +0x63 main.main() /home/brentp/go/go/src/github.com/brentp/gsort/cmd/gsort/gsort.go:373 +0x513 Failed to read from standard input: unknown file type 2023-02-13 14:26:19,012 [I] bgzipping and indexing v94/GRCh38/CosmicCodingMuts-prep.vcf.gz. 2023-02-13 14:26:19,058 [I] Sorting v94/GRCh38/CosmicNonCodingVariants.vcf.gz to match the order of /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa. gsort version 0.1.4 [W::vcf_parse] Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.) [E::vcf_format] Invalid BCF, CONTIG id=0 not present in the header [flush_buffer] Error: cannot write to - Failed to read from standard input: unknown file type panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x546d43] goroutine 1 [running]: bufio.NewReaderSize(...) /home/brentp/go/src/bufio/bufio.go:49 bufio.NewReader(...) /home/brentp/go/src/bufio/bufio.go:62 github.com/brentp/gsort.Sort(0x7ad380, 0x0, 0x7ad3a0, 0xc000629d40, 0xc000629d00, 0xaf0, 0x0, 0x0, 0x0) /home/brentp/go/go/src/github.com/brentp/gsort/gsort.go:116 +0x63 main.main() /home/brentp/go/go/src/github.com/brentp/gsort/cmd/gsort/gsort.go:373 +0x513 Failed to read from standard input: unknown file type 2023-02-13 14:26:19,137 [I] bgzipping and indexing v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz. 2023-02-13 14:26:19,175 [I] Combining COSMIC files to v94/bcbio_ready/hg38/cosmic.vcf.gz. INFO 2023-02-13 14:26:19 MergeVcfs
** NOTE: Picard's command line syntax is changing.


** For more information, please see:


https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


** The command line looks like this in the new syntax:


** MergeVcfs -O v94/bcbio_ready/hg38/cosmic.vcf.gz -D /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.dict -I v94/GRCh38/CosmicCodingMuts-prep.vcf.gz -I v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz -USE_JDK_DEFLATER true -USE_JDK_INFLATER true -CREATE_INDEX false


[Mon Feb 13 14:26:19 CST 2023] MergeVcfs INPUT=[v94/GRCh38/CosmicCodingMuts-prep.vcf.gz, v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz] OUTPUT=v94/bcbio_ready/hg38/cosmic.vcf.gz SEQUENCE_DICTIONARY=/home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.dict CREATE_INDEX=false USE_JDK_DEFLATER=true USE_JDK_INFLATER=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json [Mon Feb 13 14:26:19 CST 2023] Executing as bcbio@dell-PowerEdge-R7525 on Linux 5.15.0-46-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_332-b09; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.27.5 [Mon Feb 13 14:26:19 CST 2023] picard.vcf.MergeVcfs done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=514850816 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: file:///home/data/bcbio/cosmic-prep/v94/GRCh38/CosmicCodingMuts-prep.vcf.gz at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:97) at htsjdk.tribble.TabixFeatureReader.(TabixFeatureReader.java:82) at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:117) at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:81) at htsjdk.variant.vcf.VCFFileReader.(VCFFileReader.java:145) at picard.vcf.MergeVcfs.doWork(MergeVcfs.java:182) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113) Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119) at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79) at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37) at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:95) ... 8 more Traceback (most recent call last): File "prepare_cosmic.py", line 246, in main(args.cosmic_version, args.bcbio_directory, args.overwrite, args.clean) File "prepare_cosmic.py", line 62, in main ready_cosmic = combine_cosmic(sorted_inputs, bcbio_ref, out_file) File "prepare_cosmic.py", line 153, in combine_cosmic subprocess.check_call(cmd) File "/home/data/bcbio/anaconda/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['picard', 'MergeVcfs', 'O=v94/bcbio_ready/hg38/cosmic.vcf.gz', 'D=/home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.dict', 'I=v94/GRCh38/CosmicCodingMuts-prep.vcf.gz', 'I=v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz', 'USE_JDK_DEFLATER=true', 'USE_JDK_INFLATER=true', 'CREATE_INDEX=false']' returned non-zero exit status 1. `

I have also tried COSMIC version 93 and earlier versions such as v90 , but it seems the VCF files are not available anymore for these versions. Furthermore, for version from v94 on, though the vcf files could be downloaded successfully, the following process always halted because of error: Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.) I assume format changes occurred to these newer versions of vcf files. Could you please update the bcbio_nextgen.py script so as to avoid such bugs? Thanks ~

wangpenhok commented 1 year ago

I looked through the python script to prepare cosmic.vcf.gz for hg38 manually. Interestingly, I found the problem arises at the CMD bcftools norm --check-ref s --do-not-normalize -f {ref_file}, which fails because the COSMIC.pre.vcf does not contain contig information in its header, similar to the problem described here: https://github.com/samtools/bcftools/issues/766. So, I tried bgip the COSMIC.pre.vcf and tabix it, and then the following steps went well until bcftools view -e 'SNP=1'. But if you skip this cmd, you can finally run through all commands.

So my queston is the step bcftools view -e 'SNP=1' important for annotation? And how could we fix this problem? I didn't see any info regarding "SNP" in the header of the original COSMIC.vcf files, possibly due to the updated version.

lbeltrame commented 1 year ago

That command excludes the known SNPs (non-tumor) from the COSMIC data. I believe they don't ship these variants anymore, though.

wangpenhok commented 1 year ago

Yeah , so I skipped this step as well. I manually prepared vcf files of CodingVariants and NonCodingVariants separately. This time, the problem arises when I tried to merge the above two vcf files with error indicating that the contigs of them are not compatible. It is strange that even if I normalized according to the same reference genome, the contigs of NonCodingVcf is ordered as "chr1, chr2, chr3, chr 4....." while the CodingVcf is ordered as "chr1, chr10, chr11, chr12....chr2, chr22..."

wangpenhok commented 1 year ago

I cannot figure out what is wrong

lbeltrame commented 1 year ago

Did you try reordering the contigs so they all have the same order?

wangpenhok commented 1 year ago

Thanks for your advice, I may try this later. But it still confused me how come different orders are given after the same operation(exactly the same gunzip-normalize-gsort process as is shown below)

gunzip -c CosmicNonCodingVariants.vcf.gz | sed "s/^\([0-9]\+\)\t/chr\1\t/g" | sed "s/^MT/chrM/g" | sed "s/^X/chrX/g" | sed "s/^Y/chrY/g" > CosmicNonCodingVariants-fix-chrom.vcf.gz

bgzip -c CosmicNonCodingVariants-fix-chrom.vcf.gz > CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz

tabix CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz

bcftools norm --check-ref s --do-not-normalize -f /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz > CosmicNonCodingVariants-fix-chrom-bgzip-norm.vcf.gz

Lines total/split/realigned/skipped: 32580266/0/0/0

REF/ALT total/modified/added: 32580266/0/318

gsort --memory 20000 CosmicNonCodingVariants-fix-chrom-bgzip-norm.vcf.gz /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.fai > CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz

grep -v ^##contig CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz | bcftools annotate -h ./v97/GRCh38/CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort-contig_header.txt

bgzip -c CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz > CosmicNonCodingVariants-prep.vcf.gz

tabix CosmicNonCodingVariants-prep.vcf.gz

mv CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort-contig_header.txt CosmicNonCodingVariants-prep-contig_header.txt