Open joegeorgeson opened 1 year ago
It should be possible to update ESPRESSO to output a sam file that shows the final alignment used by ESPRESSO. Currently the only read level final output is the --tsv_compt
output file that you mentioned. That file says which of the detected isoforms (from the updated.gtf) each read was finally assigned to. ESPRESSO can make adjustments to the input alignments before assigning reads to isoforms and those adjustments are not currently part of the final output files. There are some intermediate files like *_read_final.txt
from the C step which have information about the adjustments that were made for each read. Those intermediate files could be used to create updated sam files as final outputs
The --tsv_compt
file can have rows like NIC/NNC NA
which mean that the read had a novel splice junction (NIC/NNC) and then the read was not assigned to any compatible isoform from update.gtf (NA). The file could also have rows like NIC/NNC ESPRESSO:chr...
which mean that the read had a novel splice junction but was able to be assigned to some compatible isoform. Did all of the NIC/NNC rows in the --tsv_compt
file have NA?
If you use the updated.gtf to realign with minimap2 then I think the output sam would be closer to the final alignments that ESPRESSO used, but not necessarily the same. ESPRESSO and minimap2 might make some different choices in cases where there is not a clear best choice of alignment
It should be possible to update ESPRESSO to output a sam file that shows the final alignment used by ESPRESSO.
If you can implement this it will be super helpful! This is with downstream applications in mind.
Did all of the NIC/NNC rows in the --tsv_compt file have NA?
I've confused myself a bit now - when I run against a reference of "hg38 + GENCODE_v43" I get do ESPRESSO:chrN:NNNN
compatible_isoforms for all read_classification along with 'NA' in the --tsv_compt
output, but if I run with "hg19 + UCSC knownGene" it's only 'NA'. Is there any requirement of the input annotation by ESPRESSO?
I'll plan to work on the sam output when I have some time
I'm not sure why ESPRESSO is giving no compatible isoforms for NIC/NNC reads when run with hg19 and UCSC knownGene ESPRESSO. I think for ESPRESSO to find a compatible isoform it just needs to meet the --read_num_cutoff which defaults to 2. As long as there were at least --read_num_cutoff reads to support each junction then ESPRESSO should be able to define a novel isoform to assign that read to. The reads supporting each junction need to have "perfect" alignments around the junction they support where "perfect" means no insertions deletions or substitutions within 10nt of the junction. If all the NIC/NNC have NA for compatible isoforms then it seems like there is always some kind of sequence mismatch around the novel junctions
Was the same reference data used to align the reads and run ESPRESSO?
Dear espresso,
Thank you for this amazing tool!
I am hoping you can recommend how I might create an updated sam file that reflects the annotation assignment provided from
--tsv_compt
ofESPRESSO_Q.pl
. Is this something you're willing to implement in the future?Best, Joe
edit I also now realized that isoforms annotated as 'novel_isoform' in the
updated.gtf
are not reflected in the--tsv_compt
, where I only see original annotations + those annotated asFSM, ISM, NCD, NIC/NNC
, however with the later thecompatible_isoforms
is NA. So I guess what I'm after is to have an updated sam that is an accurate representation of mapping to the transcript_id in theupdated.gtf
...if I create a new reference genome from theupdated.gtf
, is the output sam fromminimap2
expected to be in agreement with espresso?