andrewrech / antigen.garnish

Other
45 stars 13 forks source link

dai_uuid output column #152

Closed SofiaOtero closed 11 months ago

SofiaOtero commented 2 years ago

Hi,

I have a question regarding the dai_uuid column as I want to compare the mutated peptides with their wt peptide. I was told that this is the column to compare on, but in my output file I have several peptides that have an empty dai_uuid column, should I just filter these peptides out or is there a reason for this?

Kind regards Sofia

leeprichman commented 2 years ago

Hi Sofia, The dai_uuid and the blast_uuid columns are used to pair peptides to their matches for differential agretopicity calculation. Blast_uuid is derived from searching the peptide against a reference proteome with blast, while the dai_uuid is present when the input peptide is derived from a point mutation with a known wild-type for comparison. The most conservative (lowest) value is used for the final DAI value. A mutant peptide with no values in the dai uuid column is a mutation that was not input as a point mutation from a VCF (direct input or a frameshift). A peptide with no values in the blast_uuid column did not have a suitable match ny blast. If you peptide has a DAI value but no DAI_uuid, then it was matched via the BLAST_uuid. Does that help?

SofiaOtero commented 2 years ago

Thank you very much. I was just wondering how come so many of the peptides are not present in the blast_uuid nor dai_uuid? Is it because you only search for perfect matches in the wild type and reference proteome or do you allow for some nucleotide mismatches?

From an earlier question you told me that the ensemble score was calculated with the mean from netmhc and mhcflurry and I see that is the case in my data. Though if neither netmhc or mhcflurry were run but netmhcpan was run then the ensemble score column is empty, should it not have contained a value from netmhcpan then?

Kind regards Sofia

leeprichman commented 2 years ago

Hi Sofia,

Sorry for the delayed reply, I've started residency so my time to work on this is limited.

dai_uuid is present if antigen.garnish input is recognized as a missense mutation (cDNA_change level input or from a VCF) and an exact cognate wild-type peptide can be determined. blast_uuid is generated when the peptide (regardless of input type) is blasted against the normal genome to find near matches. That doesn't always happen because the blast parameters ignore simple repetitive sequences. The BLAST parameters were taken from Benjamin Greenbaum's 2017 Nature paper on neoantigen fitness. All possible blast matches that meet these criteria are returned and the best match is used for the DAI calculation. If both a dai_uuid and a blast_uuid exist, the DAI generated from either of these that is the lowest value is returned in the DAI column.

Re netMHCpan: Ensemble_score would only contain a value if netMHCpan returns a nanomolar binding affinity and not just a percentile rank. If only a percentile rank is returned, which is the case for some alleles I believe, then the column will be blank. You can search for a column named something like "netMHCpan_affinity (nM)" to see if a netMHCpan nanomolar affinity value was returned for that peptide.

Hope that answers your questions!