andrewrech / antigen.garnish

Other
46 stars 13 forks source link

Enhanced Documentation. GRCh37 reference genome. #136

Closed nickhir closed 3 years ago

nickhir commented 3 years ago

Hello,

I was wondering if you plan on extending the documentation antigen.garnish 2.0.0. If I am not mistaken, the documentation on the homepage is a little outdated. For example the command antigen.summary() does not exist in my version (packageVersion("antigen.garnish") returns '2.0.0'). It would be extremely helpful to see the preprocessing steps that you performed, e.g. annotation with vep or SnpEff and also how to properly save the final result as a .tsv file for example.

Furthermore, I was wondering if I can also use your tool if my vcf file was created using the GRCh37 reference genome.

Cheers!

andrewrech commented 3 years ago

Hi @nickhir

Thanks for your interest in the software and your questions.

Sorry about the homepage being out of date - we messed up. I will fix that. The reference manual contains the same material and is up-to-date. The function documentation in R itself using ? is also always up-to-date.

e.g. annotation with vep or SnpEff

Nothing fancy - here is an example. We use default parameters for SnpEff.

save the final result as a .tsv file for example

garnish_affinity returns a data frame that can be saved from R via your favorite method. rio is nice.

Furthermore, I was wondering if I can also use your tool if my vcf file was created using the GRCh37 reference genome.

GRCh37 will work if the transcript IDs you used for annotatations (e.g. via SnpEff) are in the custom transcript DB antigen.garnish uses to determine amino acid sequences for predictions. The transcript DB file (GRChm38_meta.RDS) is in the data directory antigen.garnish downloads during installation. The default location is "$HOME/antigen.garnish". You can re-download the data files if needed.

Could you load the transcript DB into R and check if your annotations are included?

db <- readRDS("GRChm38_meta.RDS")
str(db)

If they are not, re-annotating should not be a major hurdle.

Ping me here if you run into trouble, I'm happy to help further.

Thanks

Andrew

nickhir commented 3 years ago

Thank you so much for the fast and detailed reply. It is much appreciated! Currently I have annotated my VCF files with vep but now I will use SnpEff instead and then check if the transcript IDs are present in the GRChm38_meta.RDS file.

On a different note: does garnish_variants expect additional information on top of the SnpEff ? With that I mean information such as the GT, AF, DP, ... in the FORMAT field of the vcf file.

And lastly: Am I correct in assuming that the "usual" workflow of antigen_garnish is: garnish_variants -> garnish_affinity -> garnish_antigens.

Thank you very much! Nick

andrewrech commented 3 years ago

Garnish variants does not need additional information but the VCF must meet spec or vcfR will likely fail to parse the VCF file.

That is the standard workflow.

On Dec 20, 2020, at 18:20, nickhir notifications@github.com wrote:

 Thank you so much for the fast and detailed reply. It is much appreciated! Currently I have annotated my VCF files with vep but now I will use SnpEff instead and then check if the transcript IDs are present in the GRChm38_meta.RDS file.

On a different note: does garnish_variants expect additional information on top of the SnpEff ? With that I mean information such as the GT, AF, DP, ... in the FORMAT field of the vcf file.

And lastly: Am I correct in assuming that the "usual" workflow of antigen_garnish is: garnish_variants -> garnish_affinity -> garnish_antigens.

Thank you very much! Nick

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

nickhir commented 3 years ago

If garnish_variants does not need additional information, why does the vcf have to be a paired tumor-normal vcf file? Wouldn't it be enough to simply record the tumor mutations in the "TUMOR" column.

andrewrech commented 3 years ago

Wouldn't it be enough to simply record the tumor mutations in the "TUMOR" column.

This is correct. I just briefly reviewed the code and I am not certain that a paired VCF file is actually required. However, our test VCFs are all paired, so I am not certain a non-paired VCF will work. I can look into this further if it is a work-stopping issue for you.

Maybe it is simpler for you to use a table as input? You can pass a data frame directly to garnish_affinity. Four columns are required. Here is an example (also pasted below).

> str(dt)
Classes ‘data.table’ and 'data.frame':  2 obs. of  4 variables:
 $ sample_id    : chr  "test" "test"
 $ transcript_id: chr  "ENST00000128119.1" "ENST00000128119.1"
 $ cDNA_change  : chr  "c.4988C>T" "c.4988C>T"
 $ MHC          : chr  "HLA-A*02:01 HLA-E*01:03" "HLA-DQA10402-DQB10511"
> dt
   sample_id        transcript_id cDNA_change                     MHC
1:      test ENST00000128119.1   c.4988C>T HLA-A*02:01 HLA-E*01:03
2:      test ENST00000128119.1   c.4988C>T   HLA-DQA10402-DQB10511
nickhir commented 3 years ago

Sorry to ask yet another somewhat unrelated question, but it seems like curl -fsSL "http://get.rech.io/antigen.garnish-2.0.0.tar.gz" | tar -xvz doesn`t work anymore, because the URL is returning an error. I wanted to redownload the testdata because I changed some things and ran into this error.

curl: (22) The requested URL returned error: 403

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Regarding the transcript_IDs of GRCh37: If I use the default SnpEff command, my variants only get annotated with the transcript ID (ENST00000376887), but not with the exact version (ENST00000376887.4). I saw that the transcript_id column of GRCHm38_meta.RDS uses transcripts with exact version numbers. Furthermore, in the example VCF files, the transcripts were also annotated with exact version numbers. I guess that this will cause problems in the actual analysis.

Do you have an idea what I can do to prevent this? Using vep I can specify that i want the versioned transcriptes (--transcript_version), but i didnt find a similar option for SnpEff.

andrewrech commented 3 years ago

I think you tried downloading this file as I was re-uploading it... can you try again?

I guess that this will cause problems in the actual analysis.

Correct. cDNA sequences change over time -- without knowing which versions were used, we can't know the sequences.

Not sure how to get versioned transcripts on GRCh37, sorry.