andrewrech / antigen.garnish

Other
45 stars 13 forks source link

Antigen Garnish output and Expression file #151

Closed SofiaOtero closed 2 years ago

SofiaOtero commented 2 years ago

Hello Andrew, I have a couple of questions regarding the antigen garnish output file:

  1. There is a column called pep_type with three different types (wt, mutnfs and mut_other), does this mean that the rows containing mutnfs and mut_other are the predicted neoepitopes and the rows containing wt are the normal peptides? So to find normal and mutant match one has to compare gene_id, position etc.?


  2. As I have only been able to run for MHC I molecules I get a column with affinity from netMHC and score_EL from netMHCpan and of course the %rank from both. As I can read from your paper the ensemble score is the average MHC binding affinity from all tools, but how do you calculate it?

    Pasted Graphic

    Here you can see that the same value is in netMHC affinity, best_netMHC and in ensemble score. This is the case for when running both your test example and my own data. Is this correct?


  3. In addition to previous question I do not get a column called mhcflurry_ even though I get following when running the tool:
Running mhcflurry in parallel. Reading mhcflurry output. Do you know what could be wrong since I get no output from mhcflurry?

  4. The output file I get is enormous, and I get a lot of empty columns especially for the last 7 columns in the data, but also in some cases I get peptides that do not have any transcript ID. Do you know why this could be the case, and do you have a better way to sort the data?


  5. Lastly I now noticed that there is a possibility to add an rna expression file to the prediction and I also have these files. Though I get following error:

 Done loading variants.
 Generating metadata.
 Reading local transcript metadata. 
Checking netMHC scripts in antigen.garnish data directory. 

Error in garnish_affinity(., counts = opt$EXP, peptide_length = 10:8) : 
 

Transcript expression matrix does not contain columns for all samples in input data.
 

Calls: %>% -> garnish_affinity
Removing temporary files.
 

Execution halted

 

 

My expression file has following format:

    ENST00000456328 2

    Is this the correct format? From the error I guess that it is because not all the transcripts in my VCF file are present in the expression file, is that correct?

Sorry for all the questions, if some of them are too difficult to answer you can just skip them, and thank you very much for your help.

Kind regards Sofia

leeprichman commented 2 years ago

Hi Sofia!

I can take a crack at answering these questions.

There is a column called pep_type with three different types (wt, mutnfs and mut_other), does this mean that the rows containing mutnfs and mut_other are the predicted neoepitopes and the rows containing wt are the normal peptides? So to find normal and mutant match one has to compare gene_id, position etc.?


The raw output table from garnish affinity contains every peptide-MHC combo passed to prediction, which includes wild-type peptides ("wt") for differential agretopicity calculation. "mut_nfs" indicates a non-frameshift mutation (point mutation) derived neopeptide. "mut_other" indicates frameshift OR an input format such as direct peptide input that does not make it clear what the peptide source is. If you want to match up a mutant peptide with all wild-type peptides that were used for differential agretopicity calculation (the lowest possible DA is reported to be conservative), you can use the DAI_uuid column.

As I have only been able to run for MHC I molecules I get a column with affinity from netMHC and score_EL from netMHCpan and of course the %rank from both. As I can read from your paper the ensemble score is the average MHC binding affinity from all tools, but how do you calculate it?

This value is just the arithmetic mean of all prediction tools run on the peptide that successfully returned a value. If both netMHC and netMHCpan were run, the superior prediction from netMHC is used only for the calculation. In your example, the reason all of the numbers line up most likely is because the allele was only supported by netMHC and not the other tools and so only one prediction was returned. For more detail, check out the cell systems paper.

In addition to previous question I do not get a column called mhcflurry_ even though I get following when running the tool: Do you know what could be wrong since I get no output from mhcflurry?

Not every allele is supported by mhcflurry. I'm not sure which data this is referring to, but if the allele is not supported by mhcflurry, that particular peptide will not return a value in the mhcflurry prediction column and will not be used for ensemble_score calculation.

The output file I get is enormous, and I get a lot of empty columns especially for the last 7 columns in the data, but also in some cases I get peptides that do not have any transcript ID. Do you know why this could be the case, and do you have a better way to sort the data?


The output from garnish_affinity can be enormous! We wanted to give back the raw data without a filter, but this can be dramatically filtered down using the approach from the TESLA consortium with the garnish_antigens function. Re: no transcript ID, these are likely wild-type peptides used for DA calculation.

Lastly I now noticed that there is a possibility to add an rna expression file to the prediction and I also have these files. Though I get following error:



This looks like a malformed expression file. Check out ?garnish_affinity for the correct format. It should be transcripts without version IDs in column 1, and each column should have a sample ID with count values like below.

image

Hope this helps!

SofiaOtero commented 2 years ago

Thank you very much for the fast reply.

The raw output table from garnish affinity contains every peptide-MHC combo passed to prediction, which includes wild-type peptides ("wt") for differential agretopicity calculation. "mut_nfs" indicates a non-frameshift mutation (point mutation) derived neopeptide. "mut_other" indicates frameshift OR an input format such as direct peptide input that does not make it clear what the peptide source is. If you want to match up a mutant peptide with all wild-type peptides that were used for differential agretopicity calculation (the lowest possible DA is reported to be conservative), you can use the DAI_uuid column.

I found the matches between wild type and mutation with the DAI_uuid, but I can't figure out how the ratio is calculated. If I have wt affinity: 13600.87 and mut affinity: 13479.71 I get a DAI on 0.198609544653864 how is this calculated?

Not every allele is supported by mhcflurry. I'm not sure which data this is referring to, but if the allele is not supported by mhcflurry, that particular peptide will not return a value in the mhcflurry prediction column and will not be used for ensemble_score calculation.

I am using the antigen.garnish test data from the GitHub: dir <- system.file(package = "antigen.garnish") %>% file.path(., "extdata/testdata") file <- file.path(dir, "TUMOR.vcf")

I double checked if the mhcflurry-predict command works, and it works fine. Then I looked into which alleles they support and tried to run with dt[, MHC := c("HLA-A*02:01")]. I still get this in STDOUT: Running mhcflurry in parallel. Input file 1 of 1 Merging output. Reading mhcflurry output.

but there is still no column with mhcflurry predictions in the antigen garnish output file.

This looks like a malformed expression file. Check out ?garnish_affinity for the correct format. It should be transcripts without version IDs in column 1, and each column should have a sample ID with count values like below.

Thank you, I will try with this format!

Kind regards Sofia

leeprichman commented 2 years ago

I found the matches between wild type and mutation with the DAI_uuid, but I can't figure out how the ratio is calculated. If I have wt affinity: 13600.87 and mut affinity: 13479.71 I get a DAI on 0.198609544653864 how is this calculated?

This is calculated from the ensemble score for each peptide as the ratio of wild-type to mutant affinity in nM, with a correction for large affinities taken from Luksza et al. Nature 2017. Every wt peptide with the same DAI_uuid has this ratio caluclated and the lowest is reported for the mutant peptide. Was there only one wild type peptide paired to that mutant in the final output? My guess is there is more than one and that wild type has an ensemble score around 2500nM.

Re: mhcflurry, can you save the raw mhcflurry output and post it here? I am wondering if it is malformed or empty. Is this running on docker or your local environment?

SofiaOtero commented 2 years ago

This is calculated from the ensemble score for each peptide as the ratio of wild-type to mutant affinity in nM, with a correction for large affinities taken from Luksza et al. Nature 2017. Every wt peptide with the same DAI_uuid has this ratio caluclated and the lowest is reported for the mutant peptide. Was there only one wild type peptide paired to that mutant in the final output? My guess is there is more than one and that wild type has an ensemble score around 2500nM.

If you look at this file where I have only run the test data on one allele, there is only two and two matches in the DAI_uuid and multiple of them do not even have a DAI_uuid, are you sure it is not another column? testexample.txt.zip

Re: mhcflurry, can you save the raw mhcflurry output and post it here? I am wondering if it is malformed or empty. Is this running on docker or your local environment?

How can I print the mhcflurry file, it gets deleted during the run so I only get a final output file. I am running it through my terminal on a supercomputer from my university.

Kind regards Sofia

leeprichman commented 2 years ago

So I plugged in to do the correction and this DAI is accurate for those values. Large affinity predictions are usually quite skewed, and so Ben Greenbaum implemented a correction that we deployed using the formula to prevent really high DAI values from occuring when both peptides bind weakly but the mutant is still lower nanomolar affinity:

(Wt / mut) (1 / (1 + (0.0003 wt)))

Which gives:

(13600.87 / 13479.71) (1 / (1 + (0.0003 13600.87))) = 0.1986095

To recover the mhcflurry file, first try running it from the command line with AHLFCLLACDCDL and HLA-A*02:01. Make a table with column names "allele" and "peptide" and then run it with something like mhcflurry-predict myfile.txt. If that works, we can see whats happening when it gets called from R.

SofiaOtero commented 2 years ago

Thanks!!

This looks like a malformed expression file. Check out ?garnish_affinity for the correct format. It should be transcripts without version IDs in column 1, and each column should have a sample ID with count values like below.

I tried to make my expression file correct but I still get the same error: "Transcript expression matrix does not contain columns for all samples in input data.", is this because the expression file does not contain all the transcripts in the vcf file?

To recover the mhcflurry file, first try running it from the command line with AHLFCLLACDCDL and HLA-A*02:01. Make a table with column names "allele" and "peptide" and then run it with something like mhcflurry-predict myfile.txt. If that works, we can see whats happening when it gets called from R.

I ran: mhcflurry-predict --alleles HLA-A0201 --peptides AHLFCLLACDCDL and got following output: allele,peptide,mhcflurry_affinity,mhcflurry_affinity_percentile,mhcflurry_processing_score,mhcflurry_presentation_score,mhcflurry_presentation_percentile HLA-A0201,AHLFCLLACDCDL,32158.363933979257,66.34875000000001,0.0005446150898933411,0.0031560329393376103,99.28660326086957 So it seems like it works fine?

Kind regards Sofia

leeprichman commented 2 years ago

Let's try running it from within R and writing to and from files so we can see if its a permissions issue or something.

library(data.table)
library(magrittr)
library(antigen.garnish)

dt <- data.table(allele = "HLA-A0201", peptide = "AHLFCLLACDCDL")

dt %>% data.table::fwrite("myfile.csv", quote = FALSE, sep = ",")

# check writing worked
file.exists("myfile.csv")

system("mhcflurry-predict myfile.csv --output myfileout.csv")

# check output
data.table::fread("myfileout.csv") %>% print

Let me know what the output looks like from this.

SofiaOtero commented 2 years ago

The output from this looks like following:

Wrote: myfileout.csv allele peptide mhcflurry_affinity mhcflurry_affinity_percentile 1: HLA-A0201 AHLFCLLACDCDL 32158.36 66.34875 mhcflurry_processing_score mhcflurry_presentation_score 1: 0.0005446151 0.003156033 mhcflurry_presentation_percentile 1: 99.2866

leeprichman commented 2 years ago

I forgot to answer your question about the count matrix. That error is thrown when there is not one column in the count matrix for each unique sample_id in the table. The names must match perfectly and you cannot have samples without count data listed in the input table.

Hmm looks like mhcflurry is working. Try running predictions on just this one peptide.

library(data.table)
library(magrittr)
library(antigen.garnish)

dt <- data.table(nmer = "AHLFCLLACDCDL", MHC = "HLA-A*02:01", mutant_index = 1, sample_id = "test")

dto <- garnish_predictions(dt)

# see if the columns showed up
print(names(dto)[names(dto) %like% "mhcflurry"]

The next step I can think of would be debugging and checking the interim output which is a lot more involved. I would consider using the docker if this is giving you problems.

SofiaOtero commented 2 years ago

I forgot to answer your question about the count matrix. That error is thrown when there is not one column in the count matrix for each unique sample_id in the table. The names must match perfectly and you cannot have samples without count data listed in the input table.

Thank you, I will try to remove the transcripts with no expression before running then.

Hmm looks like mhcflurry is working. Try running predictions on just this one peptide.

Is there a function called garnish_predictions? Because I get following output: Error in garnish_predictions(dt) : could not find function "garnish_predictions" Execution halted

leeprichman commented 2 years ago

garnish_affinity Sorry!

Sorry if it's not clear! What I mean to say is that It's not about the transcripts with expression of 0, its the columns with sample names. The matrix is formatted with rows as transcripts (named in the first column), and each subsequent column representing the counts for each sample. Every sample in the antigen.garnish input sample_id column must have a corresponding column with the same name in the count matrix.

SofiaOtero commented 2 years ago

Yes I understand it like all transcripts in my input vcf file must be pressent in the expression file.

I get this output from running the above code: Error in if (!(input_type == "transcript" || input_type == "peptide")) { : missing value where TRUE/FALSE needed Calls: garnish_affinity Removing temporary files. Execution halted

leeprichman commented 2 years ago

That's my fault for trying to do this without spinning up an instance and remembering the correct minimal column names. below should work.


library(data.table)
library(magrittr)
library(antigen.garnish)

dt <- data.table(pep_mut = "AHLFCLLACDCDL", MHC = "HLA-A*02:01", mutant_index = 1, sample_id = "test")

dto <- garnish_affinity(dt)

# see if the columns showed up
print(names(dto)[names(dto) %like% "mhcflurry"]
SofiaOtero commented 2 years ago

It says: Error: unexpected end of input in the end:

Running mhcflurry in parallel. Input file 1 of 1 Merging output. Reading mhcflurry output. Calculating netMHC consensus score. Calculating overall consensus affinity score. Calculating differential agretopicity. Determining peptide pairs for BLAST. Setting up dissimilarity calculation. Removing temporary files. Error: unexpected end of input Execution halted

and the output file does unfortunately not contain the mhcflurry prediction

leeprichman commented 2 years ago

That's an unexpected error and the run shouldn't have produced output. i suspect that might be from a prior run that didn't error. Can you try again with these additional arguments passed? We will first focus on troubleshooting the mhcflurry predictions.

library(data.table)
library(magrittr)
library(antigen.garnish)

dt <- data.table(pep_mut = "AHLFCLLACDCDL", MHC = "HLA-A*02:01", mutant_index = 1, sample_id = "test")

dto <- garnish_affinity(dt, save = FALSE, blast = FALSE)

# see if the columns showed up
print(names(dto)[names(dto) %like% "mhcflurry"])

# to trouble shoot blast just in case
system("which blastp")
SofiaOtero commented 2 years ago

It looks like blast does not run:

Reading mhcflurry output. Calculating netMHC consensus score. Calculating overall consensus affinity score. Calculating differential agretopicity. Determining peptide pairs for BLAST. BLAST did not run. Removing temporary files. Error: unexpected symbol in: "# to trouble shoot blast just in case system" Execution halted

leeprichman commented 2 years ago

Thats ok we told blast not to run with blast = FALSE. I was missing a parenthesis, can you rerun the lines print(names(dto)[names(dto) %like% "mhcflurry"]) and system("which blastp") if you are still in the environment with the output object dto saved?

SofiaOtero commented 2 years ago

From the run: Removing temporary files. character(0) /home/projects/SRHgroup/apps/antigen.garnish/ncbi-blast-2.10.1+-src/c++/ReleaseMT/bin/blastp

I got the mhcflurry-predict.log file containing: Traceback (most recent call last): File "/services/tools/anaconda3/2021.11/bin/mhcflurry-predict", line 8, in sys.exit(run()) File "/services/tools/anaconda3/2021.11/lib/python3.9/site-packages/mhcflurry/predict_command.py", line 200, in run models_dir = get_default_class1_presentation_models_dir(test_exists=True) File "/services/tools/anaconda3/2021.11/lib/python3.9/site-packages/mhcflurry/downloads.py", line 122, in get_default_class1_presentation_models_dir return get_path( File "/services/tools/anaconda3/2021.11/lib/python3.9/site-packages/mhcflurry/downloads.py", line 223, in get_path raise RuntimeError( RuntimeError: Missing MHCflurry downloadable file: /home/projects/SRHgroup/apps/.local/share/mhcflurry/4/2.0.0/models_class1_presentation/models. To download this data, run: mhcflurry-downloads fetch models_class1_presentation in a shell.

SofiaOtero commented 2 years ago

I can see that the path where I have the models_classI_presentation is: /home/people/sofote/.local/share/mhcflurry/4/2.0.0/models_class1_presentation/models

Can I force it to take this instead of the one it is trying to find that does not exist?

leeprichman commented 2 years ago

We let mhcflurry install and configure itself using their instructions. Any path issues would be related to an improper mhcflurry install, though I'm not sure why calling it from R is behaving differently from the command line where it seemed to work earlier. You can try reinstalling or trouble shooting on their page, but easiest solution is probably to just run the docker image and not have to worry about any of this.

SofiaOtero commented 2 years ago

Thank you so much for the help! I just made a directory with the path: /home/projects/SRHgroup/apps/.local/share/mhcflurry/4/2.0.0/models_class1_presentation/models and downloaded the models into it and now it works!