Xinglab / espresso

Other
48 stars 4 forks source link

Understand the meaning of Espresso SJ output #37

Closed junjiemama closed 8 months ago

junjiemama commented 9 months ago

Among the output files, there are a couple types of splice junction files. Could you please help to illustrate the column names of the files that I listed below? If it's possible, could you please educate me a little bit about how were these files generated and what could be the potential use of these files? I am sorry for asking such basic questions, but I am really trying to make fully use of the ESPRESSO output as much as I could. Thank you!

i.e. chr1_SJ_simplified_list SJ_cluster 11475 0 0 chr1 3492124 3740774 11475 chr1:3492124:3740774:1 3492124 3740774 1 0 0 TBD TBD 2 no yes 1 0 SJ_cluster 11476 0 0 chr1 3492124 3740774 11476 chr1:3492124:3740774:1 3492124 3740774 1 0 0 TBD TBD 2 no yes 1 0 SJ_cluster 11477 0 0 chr1 3492124 3740774 11477 chr1:3492124:3740774:1 3492124 3740774 1 0 0 TBD TBD 2 no yes 1 0 SJ_cluster 11478 0 0 chr1 4562891 4563322 11478 chr1:4562891:4563322:1 4562891 4563322 1 1 1 CT AC 2 yes yes 1 0 SJ_cluster 11478 1 1 chr1 4562891 4563994 11478 chr1:4562891:4563994:1 4562891 4563994 1 0 0 TBD TBD 2 no yes 1 1

SJ_group_all.fa

chrUn_JH584304v1:7010:14083:0 SJclst:0: group:0: AGGTTCCGAATAGCTGAGCATCATGATACGAAGCAGAAGATGTGCCAAGC chrUn_JH584304v1:19433:20156:1 SJclst:1: group:1: GGGAGTGCAGCCCGGGGGTCTGGGATGTGTGGCTTTGAATGATGTTGATG chrUn_JH584304v1:19345:20219:0 SJclst:0: group:1: CAGGGCCCTGAGCCTCCAGCTGCAGGGTTGGCTGCGATGGCAAGAACAGC chrUn_JH584304v1:20376:24796:0 SJclst:2: group:1: TGCAGGGTGAAGAGATGGCAGAATGAGATGGCTGTACAATTCCACCATGG chrUn_JH584304v1:24958:26983:0 SJclst:3: group:1: AGGGCCTTTACACACTGGAAGCACTACATGTTGCTACAGGCAGAAGAGGC

Then in each sample folder (if I have multiple samples), there is sj.list 1 chr12:72831310:72833445 chr12 72831310 72833445 1 1 m64060_200922_102352/3/ccs, m64060_200922_102352/3/ccs, 1 chr12:72837515:72839551 chr12 72837515 72839551 1 1 m64060_200922_102352/3/ccs, m64060_200922_102352/3/ccs, 1 chr12:72808405:72830456 chr12 72808405 72830456 1 1 m64060_200922_102352/3/ccs, m64060_200922_102352/3/ccs, 1 chr12:72839609:72840485 chr12 72839609 72840485 1 1 m64060_200922_102352/3/ccs, m64060_200922_102352/3/ccs, 1 chr12:72833563:72837406 chr12 72833563 72837406 1 1 m64060_200922_102352/3/ccs, m64060_200922_102352/3/ccs,

EricKutschera commented 9 months ago

Those files are only intended to be useful as intermediate files for ESPRESSO itself to use, but if you find them useful that's great

{chr}_SJ_simplified_list is written here: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_S.pl#L547 The format is the SJ_cluster line: SJ_cluster {group_number} {sort_index} {other_sort_index} {chr} {cluster_start_coord} {cluster_end_coord} And then 1 line per SJ in that cluster: {group_number} {chr}:{SJ_start_coord}:{SJ_end_coord}:{strand} {SJ_start_coord} {SJ_end_coord} {strand} {number_of_perfect_read} {number_of_reads} {1st_2_nt_in_intron} {last_2_nt_in_intron} {enum} {is_putative} {is_annotated} {is_high_confidence} {sort_index} A perfect read for a splice junction has no mismatches, insertions, or deletions around the SJ. The {enum} is: 2 -> annotated, 1 -> strand determined based on 1st and last 2 nt, 0 -> strand not determined. is_putative is 1 if the SJ was seen in the input alignments

SJ_group_all.fa is written here: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_S.pl#L554 The format is 1 line to describe the SJ: >{chr}:{SJ_start_coord}:{SJ_end_coord}:{strand} SJclst:{sort_index}: group:{group_number}: and the next line is the genomic sequence 25nt leading up to the SJ and 25nt after the SJ

sj.list is written here: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_S.pl#L880 The format is {group_number} {chr}:{SJ_start_coord}:{SJ_end_coord} {chr} {SJ_start_coord} {SJ_end_coord} {number_of_perfect_reads} {number_of_total_reads} {comma_seperated_list_of_perfect_read_IDs_for_this_SJ} {comma_seperated_list_of_all_read_IDs_for_this_SJ}

junjiemama commented 8 months ago

Thank you very much Eric for your answers! It is very helpful.

Best regards, Junjie

On Wed, Oct 4, 2023 at 1:32 PM Eric Kutschera @.***> wrote:

Those files are only intended to be useful as intermediate files for ESPRESSO itself to use, but if you find them useful that's great

{chr}_SJ_simplified_list is written here: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_S.pl#L547 The format is the SJ_cluster line: SJ_cluster {group_number} {sort_index} {other_sort_index} {chr} {cluster_start_coord} {cluster_end_coord} And then 1 line per SJ in that cluster: {group_number} {chr}:{SJ_start_coord}:{SJ_end_coord}:{strand} {SJ_start_coord} {SJ_end_coord} {strand} {number_of_perfect_read} {number_of_reads} {1st_2_nt_in_intron} {last_2_nt_in_intron} {enum} {is_putative} {is_annotated} {is_high_confidence} {sort_index} A perfect read for a splice junction has no mismatches, insertions, or deletions around the SJ. The {enum} is: 2 -> annotated, 1 -> strand determined based on 1st and last 2 nt, 0 -> strand not determined. is_putative is 1 if the SJ was seen in the input alignments

SJ_group_all.fa is written here: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_S.pl#L554 The format is 1 line to describe the SJ: >{chr}:{SJ_start_coord}:{SJ_end_coord}:{strand} SJclst:{sort_index}: group:{group_number}: and the next line is the genomic sequence 25nt leading up to the SJ and 25nt after the SJ

sj.list is written here: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_S.pl#L880 The format is {group_number} {chr}:{SJ_start_coord}:{SJ_end_coord} {chr} {SJ_start_coord} {SJ_end_coord} {number_of_perfect_reads} {number_of_total_reads} {comma_seperated_list_of_perfect_read_IDs_for_this_SJ} {comma_seperated_list_of_all_read_IDs_for_this_SJ}

— Reply to this email directly, view it on GitHub https://github.com/Xinglab/espresso/issues/37#issuecomment-1747347714, or unsubscribe https://github.com/notifications/unsubscribe-auth/A224UECERIHNUPZEDGIVWH3X5WMRBAVCNFSM6AAAAAA5RRGVQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBXGM2DONZRGQ . You are receiving this because you authored the thread.Message ID: @.***>