broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

IndexFeatureFile using GencodeGtfCodec fails for GENCODE v38 #7385

Open robby81 opened 3 years ago

robby81 commented 3 years ago

As stated in the title. I tried the new gatk version 4.2.1.0 to update the GENCODE data for Funcotator.

Log: /home/robby/Tools/NGS/gatk-4.2.1.0/gatk IndexFeatureFile -I /home/robby/Tools/NGS/gencode/hg19/gencode.v38lift37.annotation.REORDERED.gtf Using GATK jar /home/robby/Tools/NGS/gatk-4.2.1.0/gatk-package-4.2.1.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/robby/Tools/NGS/gatk-4.2.1.0/gatk-package-4.2.1.0-local.jar IndexFeatureFile -I /home/robby/Tools/NGS/gencode/hg19/gencode.v38lift37.annotation.REORDERED.gtf 14:34:51.448 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/robby/Tools/NGS/gatk-4.2.1.0/gatk-package-4.2.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so Aug 02, 2021 2:34:51 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 14:34:51.566 INFO IndexFeatureFile - ------------------------------------------------------------ 14:34:51.566 INFO IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.2.1.0 14:34:51.566 INFO IndexFeatureFile - For support and documentation go to https://software.broadinstitute.org/gatk/ 14:34:51.572 INFO IndexFeatureFile - Initializing engine 14:34:51.572 INFO IndexFeatureFile - Done initializing engine 14:34:51.674 WARN GencodeGtfCodec - GENCODE GTF Header line 1 has a version number that is above maximum tested version (v 34) (given: 38): ##description: evidence-based annotation of the human genome (GRCh38), version 38 (Ensembl 104), mapped to GRCh37 with gencode-backmap Continuing, but errors may occur. 14:34:51.676 WARN GencodeGtfCodec - GENCODE GTF Header line 1 has a version number that is above maximum tested version (v 34) (given: 38): ##description: evidence-based annotation of the human genome (GRCh38), version 38 (Ensembl 104), mapped to GRCh37 with gencode-backmap Continuing, but errors may occur. 14:34:51.679 INFO FeatureManager - Using codec EnsemblGtfCodec to read file file:///home/robby/Tools/NGS/gencode/hg19/gencode.v38lift37.annotation.REORDERED.gtf 14:34:51.684 INFO ProgressMeter - Starting traversal 14:34:51.684 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute 14:34:51.694 INFO IndexFeatureFile - Shutting down engine [August 2, 2021 at 2:34:51 PM CEST] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=113246208 java.lang.IllegalArgumentException: Unexpected value: Ensembl_canonical at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature$FeatureTag.getEnum(GencodeGtfFeature.java:1391) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature.(GencodeGtfFeature.java:197) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfTranscriptFeature.(GencodeGtfTranscriptFeature.java:19) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfTranscriptFeature.create(GencodeGtfTranscriptFeature.java:23) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature$FeatureType$2.create(GencodeGtfFeature.java:768) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature.create(GencodeGtfFeature.java:327) at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:138) at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:23) at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:43) at org.broadinstitute.hellbender.utils.codecs.ProgressReportingDelegatingCodec.decodeLoc(ProgressReportingDelegatingCodec.java:46) at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:689) at htsjdk.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:606) at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:446) at org.broadinstitute.hellbender.tools.IndexFeatureFile.createAppropriateIndexInMemory(IndexFeatureFile.java:118) at org.broadinstitute.hellbender.tools.IndexFeatureFile.doWork(IndexFeatureFile.java:75) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

If more information is needed, I can provide those.

robby81 commented 3 years ago

I used gatk-4.2.1.0-src/gatk-4.2.1.0/scripts/funcotator/data_sources/getGencode.sh to get v38 data.

droazen commented 3 years ago

@jonn-smith Could you comment on this one? The tool output clearly states that we don't support this version of Gencode, and that errors may occur:

 GENCODE GTF Header line 1 has a version number that is above maximum tested version (v 34) (given: 38): ##description: evidence-based annotation of the human genome (GRCh38), version 38 (Ensembl 104), mapped to GRCh37 with gencode-backmap Continuing, but errors may occur.

Do we claim to support 38 anywhere? (eg., in documentation, etc.)

jonn-smith commented 3 years ago

@robby81 - you did all the right things to create a new gencode datasource for Funcotator (though I should mention that the script to download the gencode version is unsupported).

However, currently we do not support Gencode v38. This is a technical limitation that I would like to remove in the near future and is exemplified by the error you've encountered (Gencode having arbitrary fields in it - the parser needs to be updated as well as some of the output maps).

dheiman commented 1 year ago

Has there been any progress on this? The CPTAC NCI consortium is going with GENCODE 42; it would be great if it was more straightforward to update the GENCODE version in our local data source.

jonn-smith commented 1 year ago

@dheiman No updates yet. I've been working on another project that is time-sensitive. When that project is complete I'm planning on spending a couple of weeks fixing this and several other related funcotator issues.

To my chagrin, I do not have an estimate on when this time-sensitive project will be complete (I am working to get it done ASAP - it has had a series of continuous deadlines for a long time now).

zhanyinx commented 1 year ago

Hey there,

is there any news on this? I would also like to update gencode db for funcotator

Thanks Best Zhan

droazen commented 1 year ago

@zhanyinx Some changes need to be made to the tool to support the newer releases of Gencode. We're planning on addressing that this quarter, after which we'll do a new data source release.

zhanyinx commented 1 year ago

@droazen Thanks