Xinglab / espresso

Other
48 stars 4 forks source link

Create sam matching espresso annotations #14

Open joegeorgeson opened 1 year ago

joegeorgeson commented 1 year ago

Dear espresso,

Thank you for this amazing tool!

I am hoping you can recommend how I might create an updated sam file that reflects the annotation assignment provided from --tsv_compt of ESPRESSO_Q.pl. Is this something you're willing to implement in the future?

Best, Joe

edit I also now realized that isoforms annotated as 'novel_isoform' in the updated.gtf are not reflected in the --tsv_compt, where I only see original annotations + those annotated as FSM, ISM, NCD, NIC/NNC, however with the later the compatible_isoforms is NA. So I guess what I'm after is to have an updated sam that is an accurate representation of mapping to the transcript_id in the updated.gtf ...if I create a new reference genome from the updated.gtf, is the output sam from minimap2 expected to be in agreement with espresso?

EricKutschera commented 1 year ago

It should be possible to update ESPRESSO to output a sam file that shows the final alignment used by ESPRESSO. Currently the only read level final output is the --tsv_compt output file that you mentioned. That file says which of the detected isoforms (from the updated.gtf) each read was finally assigned to. ESPRESSO can make adjustments to the input alignments before assigning reads to isoforms and those adjustments are not currently part of the final output files. There are some intermediate files like *_read_final.txt from the C step which have information about the adjustments that were made for each read. Those intermediate files could be used to create updated sam files as final outputs

The --tsv_compt file can have rows like NIC/NNC NA which mean that the read had a novel splice junction (NIC/NNC) and then the read was not assigned to any compatible isoform from update.gtf (NA). The file could also have rows like NIC/NNC ESPRESSO:chr... which mean that the read had a novel splice junction but was able to be assigned to some compatible isoform. Did all of the NIC/NNC rows in the --tsv_compt file have NA?

If you use the updated.gtf to realign with minimap2 then I think the output sam would be closer to the final alignments that ESPRESSO used, but not necessarily the same. ESPRESSO and minimap2 might make some different choices in cases where there is not a clear best choice of alignment

joegeorgeson commented 1 year ago

It should be possible to update ESPRESSO to output a sam file that shows the final alignment used by ESPRESSO.

If you can implement this it will be super helpful! This is with downstream applications in mind.

Did all of the NIC/NNC rows in the --tsv_compt file have NA?

I've confused myself a bit now - when I run against a reference of "hg38 + GENCODE_v43" I get do ESPRESSO:chrN:NNNN compatible_isoforms for all read_classification along with 'NA' in the --tsv_compt output, but if I run with "hg19 + UCSC knownGene" it's only 'NA'. Is there any requirement of the input annotation by ESPRESSO?

EricKutschera commented 1 year ago

I'll plan to work on the sam output when I have some time

I'm not sure why ESPRESSO is giving no compatible isoforms for NIC/NNC reads when run with hg19 and UCSC knownGene ESPRESSO. I think for ESPRESSO to find a compatible isoform it just needs to meet the --read_num_cutoff which defaults to 2. As long as there were at least --read_num_cutoff reads to support each junction then ESPRESSO should be able to define a novel isoform to assign that read to. The reads supporting each junction need to have "perfect" alignments around the junction they support where "perfect" means no insertions deletions or substitutions within 10nt of the junction. If all the NIC/NNC have NA for compatible isoforms then it seems like there is always some kind of sequence mismatch around the novel junctions

Was the same reference data used to align the reads and run ESPRESSO?