Closed ARChakravarthy closed 3 years ago
For additional context - mutations were called with MuTect 2 on hg19, then lifted over to hg38 using Picard , and the samples were annotated using SnpEff with their cancer specific annotation (with a pedigree file).
It looks like the MHC alleles are malformed so the error is running because malformed MHC rows are removed from the table and this leads to attempting a data.table by operation on the vector 1:0 on an empty data.table. Please add an asterisk after A B or C to all of your MHC alleles. For example, HLA-A*03:01
The HLA alleles of the objects I called dput on are correctly formed, and the error persists despite this. Please see attached screenshot
Looks like it was github markdown messing up the MHC alleles my apologies!
I took a look with your SNV3 table using the 2.1.1 docker image. The issue appears to be that no matches are found in the metadata file because your transcript IDs do not have version numbers appended to them unfortunately. This results in the empty data table at the extraction step throwing the error. You will need to get all possible version numbers for transcripts derived from your variants.
@andrewrech Can you update readme example table and documentation to reflect need for transcript version numbers?
Hmm, so since SnpEff tends to not always append the version numbers to the output, would it be worth me creating a little function to basically do a left join on ENST identifiers with versioned ENST identifiers for inclusion in the package? I haven't contributed code to github repositories but I am happy to send along any code I develop by email so it can be a helper function.
Simply combining the transcript_id with all possible transcripts produced would not work because the cDNA change can vary based on the transcript. This must be done at the variant annotation level because the same alternate allele can have disparate effects on transcript coding sequences in different transcript versions. I'm not sure if the lack of versions as something to do with the hg19 liftover. @andrewrech might have some insight.
That is down to the output of snpEff rather than the variant caller per se because transcript information is not present in the VCFs, merely the coordinates and the sequences at them ; I will have to check how to get snpEff to kick out versioned annotations with the preconfigured database they have for hg38 ; since antigen.garnish depends on SnpEff annotated VCFs it may be a critical issue. My understanding is that different transcripts for a gene (ENSG) have different numbers (ENST) , and Ensembl annotation, at least as implemented in SnpEff, gives you unversioned output.
Right, I'm referring to snpeff, the variant annotator not a variant caller. We have not historically had a problem with this, calling variants against hg38 and annotating with snpEff but we usually work with hg38 aligned data from the start.
SnpEff gives versioned output. This is why we use it. The version is critical because ~5% on transcript cDNAs change across versions. This may or may not be an acceptable potential rate of error
Whoops sorry did not mean to close this
Can you tell me what versions of snpEff and the snpEff databases you've been using, and also if you go with the default standard vcf annotation (which prioritises canonical transcripts when it reports variants) , or with the traditional Eff format (-formatEff).
I've been using 4.1 with the reference being the pre-built snpEff database for GRCh38.103
Maybe that is too old? We test against this version from a few years ago. Think anything newer also gives versioned output AFAIK.
Yup, tested with 4.3 and I can confirm that output is now versioned by default , unlike 4.1. Maybe documentation would benefit from a version requirement explicitly specified for snpEff (>/= 4.3)
garnish_affinity keeps breaking with the following error message for both a vcf parsed using garnish_variants, and a sample_id, transcript, cDNA change, and HLA annotations. The pipeline works fine for the test data included with the package, for context.
Dput calls for the input objects follow
Tabular data was then generated from the VCFs
SNV3 <- SNV2%>%mutate(sample_id = "S1")%>%select(sample_id,transcript_id = feature_id, cDNA_change , MHC )
Dput result for SNV3
Calls
x <- garnish_affinity(SNV2) y <- garnish_affinity(dt = data.table(SNV3))
Both these calls produce the error message I posted at the start