Understanding output columns

spvensko commented 9 months ago

Hello! I was able to get the tool to work and have analyzed two Hugo et al., 2016 patients. I am reviewing the outputs from frequency_stage3_verbosity1_uid_gene_symbol_coord_mean_mle.txt and have a quick question regarding what each column is describing.

For reference, I am using GATK's Homo_sapiens.assembly38.fasta reference fasta for alignment and GENCODE's v37, v43, and v45 GTFs to determine exonic coordinates.

With that in mind, here is a line from the file:

EKDTPRYSF,ENSG00000004846:E15.1-E16.1   ['HugoLo_IPRES_2016-Pt01-ar-279.Aligned.sortedByCoord.out.bed', 'HugoLo_IPRES_2016-Pt02-ar-280.Aligned.sortedByCoord.out.bed']  2   ABCB5   chr7:20658676-20659066(+)   0.5102824568748474  0.9999946440084598

The peptide of interest is EKDTPRYSF which is from gene ENSG00000004846. The chr7:20658676-20659066(+) should be genomic coordinates containing the peptide of interest.

If I pull chr7:20658676-20659066 from Homo_sapiens.assembly38.fa and then translate it through ExPASy's web service, the peptide of interest, EKDTPRYSF doesn't appear to be present:

samtools faidx Homo_sapiens.assembly38.no_ebv.fa chr7:20658676-20659066
>chr7:20658676-20659066
GGTAAGTGAGCAGAAACGTTTCTTATTTCCATACTCCTGGTTCATTATTGTTTTGAAGTA
CAAGAAAGTATAGATCTGTAATAGATTACTCAAGTTGAGAGCCCTCTTAAGGTATAAAGG
CAGGATGTTAATCCACTGAGAACTTACGTGATGGCTATAGGAAGTGGTTTAGAGGACAGA
AGGAGATGCTGTGGTTGGTTGGTGTAAAAATATATACATGAGGCTGATACACAAGCAATC
ATCCAGTCTATACCTCCATTCCAAGTGGTTTGCACTTTCCACCTCCCTAGAGTGGCCCAC
CACTATCATCACTATTATAACCATGCCCACCCTTTGCTTCTTCTACATACACCTGTGGGA
TTCTCTTCTCTGACCACTTTTCTTCTTTAGG

Can you please help me understand these columns so I may better understand my results?

Thanks, Steven P. Vensko II

frankligy commented 9 months ago

Hi @spvensko,

Glad the tool ran smoothly on your end, the chromsome coordinate is the 5' splicing site and 3' splicing site, so the junction jumps from 20658676 to 20659066, instead of taking the part in between, if that makes sense.

Please see below screenshot for the peptide generated from this junction:

Screenshot 2024-01-30 at 12 33 56 PM Screenshot 2024-01-30 at 12 34 03 PM

Best, Frank

spvensko commented 9 months ago

Thank you for the excellent explanation!

Another oddity I noticed:

AADVSGLPL,ENSG00000110427:E1.1-I1.1 ['HugoLo_IPRES_2016-Pt01-ar-279.Aligned.sortedByCoord.out.bed'] 1 KIAA1549L chr11:33542146-33542147(+) 0.03866191580891609 0.9999941348374569

In this case, the 5' and 3' splice sites are neighboring bases, correct?

I checked the SJ.out.tab file, but wasn't able to find any evidence for this junction:

chr11   33531115    33533544    2   2   1   1   0   43
chr11   33542147    33542920    1   1   1   2   1   22
chr11   33542147    33544766    1   1   1   11  0   43
chr11   33544337    33544766    1   1   1   18  0   47

Is this a false positive or is there a different explanation? Also, can you explain the EX1.Y1-EX2.Y2 and EX1.Y1-IX2.Y2 nomenclature (e.g. E1.1-I1.1)?

frankligy commented 9 months ago

Hi @spvensko,

Thanks for bringing this up, this is an intron retention (intron 1), meaning the whole intron 1 is not properly excised but retained in the transcript, resulting in a read-through. That's why this is only one base difference.

It won't be reported in STAR SJ.out.tab, as far as I understand, only reports junctions but not intron retention. In our Supplementary figure 1, we illustrated how we define the Exon ID and segment ID (the question you mentioned), I also pasted below hoping that can clarify some confusion:

In our Supplementary Figure 2, we showed a benchmark my lab mate conducted before for intron retention prediction against other tools using simulated data.

Let me know if I can help answering any question!

Best, Frank

frankligy / SNAF

Understanding output columns #26