frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
35 stars 8 forks source link

Longer epitope sequences #31

Open p-levy opened 5 months ago

p-levy commented 5 months ago

Hi Frank,

First of all congrats and thanks to share this great tool!

I've managed to run it on a small subset of our own patients. We usually test the immunogenicity of neoantigens as longer epitopes (in general 25-mers). Is it possible to get a longer peptide sequence than the outputted predicted minimal epitopes? Like 25- or even 40-mer around the junctions. Maybe that's already an option but I couldn't find it.

Thanks again! Pierre

frankligy commented 5 months ago

Hi @p-levy,

Glad you are able to run the tool, correct me if I am wrong, you are hoping to get longer sequence flanking the splicing junction (25mer or 40mer) compared to what is currently reported 9mer or 10mer.

This is not coded in the current program, as when I started this project, we were particularly interested in MHC-I presented peptides which typically ranges from 8-11mer, and 9mer and 10mer covers the majority. But I think I might have the codes (Another project I am working on) that might be able to help you:

Given each SNAF reported Junction chromosome coordinates (chrx:xxxxx-xxxxx(strand))

Step 1: Download hg38 fasta and build a dictionary as below

from Bio.SeqIO.FastaIO import SimpleFastaParser
dict_fa = {}  # {chrom:seq}
with open(path_to_hg38.fa,'r') as in_handle:
    for title,seq in SimpleFastaParser(in_handle):
        dict_fa[title] = seq

Step 2: Given SNAF coordinate, retrieve the flanking sequence as long as you want

from Bio.Seq import Seq
MAX_PEP_LEN=11
MIN_PEP_LEN=8
overhang = 50
# phase definition refer to https://www.biostars.org/p/63864/
item = 'chrx:xxxxx-xxxxx(+)'
chrom,coord = item.split(':')
coord,strand = coord.split('(')
strand = strand.rstrip(')')
start,end = coord.split('-')
start,end = int(start),int(end)
if strand == '+':
    first_section = dict_fa[chrom][start-overhang:start]
    second_section = dict_fa[chrom][end-1:end-1+overhang]
elif strand == '-':
    old_first_section = dict_fa[chrom][start-overhang:start]
    old_second_section = dict_fa[chrom][end-1:end-1+overhang]
    # reverse complement process
    first_section = Seq(old_second_section).reverse_complement()
    second_section = Seq(old_first_section).reverse_complement()
# consider phase = 0
defacto_first = first_section[-(3*(MAX_PEP_LEN-1)):]
defacto_second = second_section[:(3*(MAX_PEP_LEN-1))]
defacto_junction = defacto_first + defacto_second
defacto_peptide = str(Seq(defacto_junction).translate(to_stop=False))
# consider phase = 1
defacto_first = first_section[-(3*(MAX_PEP_LEN-1)+2):]
defacto_second = second_section[:(3*(MAX_PEP_LEN-1)+1)]
defacto_junction = defacto_first + defacto_second
defacto_peptide = str(Seq(defacto_junction).translate(to_stop=False))
# consider phase = 2
defacto_first = first_section[-(3*(MAX_PEP_LEN-1)+1):]
defacto_second = second_section[:(3*(MAX_PEP_LEN-1)+2)]
defacto_junction = defacto_first + defacto_second
defacto_peptide = str(Seq(defacto_junction).translate(to_stop=False))

if '*' not in defecto_peptide:
    print(defecto_peptide)

Step 3: run NetMHCpan4.1

def run_netMHCpan(software_path,peptides,hlas,length,cmd_num=1,tmp_dir=None,tmp_name=None):
    # set the default
    if tmp_dir is None:
        if not os.path.exists(os.path.join(os.path.dirname(os.path.abspath(__file__)),'scratch')):
            os.mkdir(os.path.join(os.path.dirname(os.path.abspath(__file__)),'scratch'))
        tmp_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)),'scratch')
    if tmp_name is None:
        tmp_name = 'input_{}.pep'.format(os.getpid())
    # reformat/create to the strings that we need
    peptides_path = os.path.join(tmp_dir,tmp_name)
    with open(peptides_path,'w') as f:
        for pep in peptides:
            f.write('{}\n'.format(pep))
    # netMHCpan has a hard limit for 1024 characters for -a, so to be safe, 90 HLA as max for one goal, each input format 11 character 'HLA-B57:01,'
    if len(hlas) <= 90:
        hla_strings = ','.join(hlas)
        if cmd_num == 1:
            reconstruct = ' || '.join(['$3 == ' + '"{}"'.format(pep) for pep in peptides])
            cmd = '{} -p {} -a {} -BA -l {} | awk \'BEGIN {{OFS = "\\t"}} {{if ({}) {{print $3, length($3),$2,$13,$16,$18}}}}\''.format(software_path,peptides_path,hla_strings,length,reconstruct)
            '''
            ../external/netMHCpan-4.1/netMHCpan -p ./test.pep -a HLA-A01:01,HLA-A02:01 -BA -l 9 | awk 'BEGIN {OFS = "\t"} {if ($3 == "AAAWYLWEV" || $3 == "AAGLQDCTM" || $3 == "AARNIVRRA") {print $3, length($3),$2,$13, $15}}'
            '''
        try:
            df = pd.read_csv(StringIO(subprocess.run(cmd,shell=True,capture_output=True).stdout.decode('utf-8')),sep='\t',header=None)
            df.columns = ['peptide','mer','hla','score','binding_nM','identity']
        except:   # no stdout, just no candidates
            df = pd.DataFrame(columns=['peptide','mer','hla','score','identity'])

def run_netMHCpan_wrapper(peptide,hlas):
    # hlas is a list, each hla allele follow that format[HLA-A02:01]
    netMHCpan_path = '/path/to/netMHCpan-4.1/netMHCpan'
    df = run_netMHCpan(software_path=netMHCpan_path,
                       peptides=[peptide],
                       hlas=hlas,
                       length=len(peptide),
                       cmd_num=1,
                       tmp_dir=None,
                       tmp_name=None)
    return df

Step 4: Immunogenicity Prediction

Currently DeepImmuno only supports 9 and 10mer, but there are other tools I am aware of that have more flexible length requirement such as DeepHLApan.

I hope this could help a bit!

Best, Frank

p-levy commented 5 months ago

Hi Frank,

Thanks for your prompt and detailed answer! I played a bit with the code you shared and that's already helpful. However, I realized that what I would really like is to focus on the junctions where at least the first section is within an annotated CDS. Then I'd like to get the AA-seq of custom length (e.g. 25-mer) around the junction but only using the phase of the annotated CDS. Is there an "easy" way to do that? That would be great!

All the best, Pierre

frankligy commented 5 months ago

Hi @p-levy,

I see, for my current project, I actually decided to take all three phases due to now increasing report on non-canonical ORFs (cryptic ORFs) and the well-characterized mechanisms illustrating why it is common in cancer.

Ref1: https://www.nature.com/articles/s41587-021-01021-3 Ref2: https://www.nature.com/articles/s41586-023-06500-y

But I understand your need, I actually implemented this in the SNAF, but it is kind of obsolete and I haven't really rigorous tested this function, but I can share.

In your download database folder, you will find two required input:

[1] df_start_codon.txt [2] Homo_sapiens.GRCh38.91.gtf

You followed this code block (https://github.com/frankligy/SNAF/blob/main/snaf/snaf.py#L62-L96) to get two variable dict_start_codon and phase_inferer_gtf_dict.

I have a function to get_support_phase here (https://github.com/frankligy/SNAF/blob/main/snaf/snaf.py#L98-L189), but the phase definition here is a bit different (probably only makes sense to me) and don't want to confuse you, so I just want to explain the logic and you can modify the code as you need.

Screenshot 2024-02-16 at 11 03 31 AM

Hoping this could help, Frank

p-levy commented 5 months ago

Thanks! I'll test this asap.

p-levy commented 3 months ago

Hi Frank,

It took me a while but I think that thanks to your help I'm almost getting what I want. I still have two questions:

1) Is there somewhere in the outputs a list of all cancer-specific junction coordinates, irrespective of if some of the resulting peptides are HLA binders. So far I've been using the coord values reported in the T_antigen_candidates.txt files but I feel like junctions with non binders are not included there as the maximum binding affinity seems to be 2.

2) You're strategy to get the defacto_peptide sequence given above works pretty well in general. However I was wondering what would happen if the size of the peptide I'd want to retrieve would be greater than the size of some whole exons. This strategy would then be translating from intronic sequences, right? Do you think of an alternative to work with cDNA/CDS coordinates?

Thanks again! Pierre

frankligy commented 3 months ago

Hi @p-levy,

[1] yes, the frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt has the coordinates for all cancer specific junctions, and NeoJunction_statistics_maxmin.txt can tell you how all the junctions are filtered based on their presence in normal database to arrive at these cancer-specific junctions. And you can also go more low-level and get the coordinates for any junction:

from snaf import surface
surface.initialize(db_dir=db_dir)
uid = 'ENSG00000198034:E8.4-E9.1'
junction_coord = surface.main.uid_to_coord(uid)

[2] I agree with you, there will are cases where exon is shorter than your desired length, so that flanking intron sequence will be included. That's the caveats for this simple makeshift solution. There are definitely ways to achieve what you are looking for here, something came to my mind is first take a normal hg38 gtf file in which you can probably download from Gencode or UCSC genome browser. Then index it using tabix so that you can retrieve certain region of the chromsome.

module load samtools
module load htslib
sort -t $'\t' -k1,1 -k4,4n hg38.gtf | bgzip > hg38.gtf.gz
tabix -p gff hg38.gtf.gz

Now you can use the coordinate, and write some custom script to do that:

import pysam
tbx_file = pysam.TabixFile('hg38.gtf.gz')
feature_rows = tbx_file.fetch(chromsome,start-1,end)
for feature in feature_rows:
    # here, you will have the coordinate of each exon, cds, trancript, UTR and other features that are overlapping with your junction coordinate, You can save these coordinate into a nested list or dict. So you will know when extending the flanking sequence will exceed the exon boundary so you can jump to the next exon.

I hope this helps a bit, Frank