Exon is not exactly recognized

Hi,

Now I'm using portcullis for a published data as trial. Mapping with hisat2 is successful as shown below on IGV.

Then, I'm running portcullis, but exon seems not exactly recognized. According to the closed issue #44, I tried --threshold 1 and 0.5 and compared their results as shown below.

$ singularity exec ~/008_Singularity/maplesond-portcullis-master-latest.simg portcullis prep -o ./portcullis_out_6_SRR2313075/1-prep -t 8 ~/ genome.fa ~/006_Portcullis_1/hisat2KM/SRR2313075.bam

$ singularity exec ~/008_Singularity/maplesond-portcullis-master-latest.simg portcullis junc -t 4 ./portcullis_out_6_SRR2313075/1-prep -o ./portcullis_out_6_SRR2313075/2-junc

$ singularity exec ~/008_Singularity/maplesond-portcullis-master-latest.simg portcullis filt -t 4 --exon_gff --intron_gff --threshold 0.5 ./portcullis_out_6_SRR2313075/1-prep ./portcullis_out_6_SRR2313075/2-junc.junctions.tab -o ./portcullis_out_6_SRR2313075/3-filt_thr0.5

$ singularity exec ~/008_Singularity/maplesond-portcullis-master-latest.simg portcullis filter -t 4 --exon_gff --intron_gff --threshold 1 ./portcullis_out_6_SRR2313075/1-prep ./portcullis_out_6_SRR2313075/2-junc.junctions.tab -o ./portcullis_out_6_SRR2313075/3-filt_thr1.0

I found that 1.0 is better than 0.5, but 1.0 is still not the best, I think.

For example, chr1:154570304-154570452 is a true exon of ADAR as shown above on IGV. However, when I checked the pass.junctions.exon.gff3, the exon is not correctly recognized.

$ grep ^chr1.*$'\t'154570 3-filt_thr1.0.pass.junctions.exon.gff3 

chr1    portcullis_pass match   154569656   154570392   0.0 -   .   ID=junc_23583;Name=junc_23583;Note=cov:575|rel:575|ent:5.666|maxmmes:45|ham:7;mult=575;grp=junc_23583;src=E;Strand: NEGATIVE;Canonical?=Canonical;Score=1;NbAlignments=575;NbDistinct=136;NbReliable=575;Entropy=5.66573;MaxMMES=45;HammingDistance5=7;HammingDistance3=8;UniqueJunction=true;PrimaryJunction=true;
chr1    portcullis_pass match_part  154570304   154570392   0.0 -   .   ID=junc_23583_right;Parent=junc_23583
chr1    portcullis_pass match   154570364   154570966   0.0 -   .   ID=junc_23585;Name=junc_23585;Note=cov:547|rel:516|ent:5.807|maxmmes:45|ham:7;mult=547;grp=junc_23585;src=E;Strand: NEGATIVE;Canonical?=Canonical;Score=1;NbAlignments=547;NbDistinct=131;NbReliable=516;Entropy=5.80729;MaxMMES=45;HammingDistance5=7;HammingDistance3=8;UniqueJunction=true;PrimaryJunction=true;
chr1    portcullis_pass match_part  154570364   154570452   0.0 -   .   ID=junc_23585_left;Parent=junc_23585
chr1    portcullis_pass match_part  154570878   154570966   0.0 -   .   ID=junc_23585_right;Parent=junc_23585
chr1    portcullis_pass match   154570974   154573605   0.0 -   .   ID=junc_23587;Name=junc_23587;Note=cov:424|rel:424|ent:5.815|maxmmes:44|ham:7;mult=424;grp=junc_23587;src=E;Strand: NEGATIVE;Canonical?=Canonical;Score=1;NbAlignments=424;NbDistinct=114;NbReliable=424;Entropy=5.81478;MaxMMES=44;HammingDistance5=7;HammingDistance3=7;UniqueJunction=false;PrimaryJunction=true;
chr1    portcullis_pass match_part  154570974   154571061   0.0 -   .   ID=junc_23587_left;Parent=junc_23587
chr1    portcullis_pass match   154570987   154573782   0.0 -   .   ID=junc_23588;Name=junc_23588;Note=cov:3|rel:3|ent:0.9183|maxmmes:44|ham:7;mult=3;grp=junc_23588;src=E;Strand: NEGATIVE;Canonical?=Canonical;Score=1;NbAlignments=3;NbDistinct=3;NbReliable=3;Entropy=0.918296;MaxMMES=44;HammingDistance5=7;HammingDistance3=8;UniqueJunction=false;PrimaryJunction=false;
chr1    portcullis_pass match_part  154570987   154571061   0.0 -   .   ID=junc_23588_left;Parent=junc_23588

I think "--threshold 1" may still be too sensitive. Any help would be greatly appreciated,

Thanks,

Dear @KMgoo , thank you for reporting. The first step here is to verify whether the exon you are looking for has been mistakenly tagged as one of the "true negatives" in the first pass of portcullis. If that is the case, then that would explain why you are missing the intron.

Inside the filtering folder, there are some .tab files which list which junctions have been selected as "true negatives" (neg_layer files) and "true positives" (pos_layer).

It would be important to verify that the missing exon has not been incorrectly recognised as a false positive by the first pass. Would you please be able to check that, or even better, maybe send us the layer files within the filtering folder?

Another possibility is that the intron is recognised as potentially valid, but with a low probability. This could be verified by filtering with a lower threshold, e.g. 0.1 (a ludicrously low threshold, which I only recommend for this limited type of testing).

Thank you again for reporting!

Hi, @lucventurini, apologies for the late reply. Although I tried portcullis with several options, I could not obtain neg_layer/ pos_layer files in the filtering folder. The filtering folder is shown below. Which files should I send?

$ ls -l 3-filt_thr1.0/
合計 350012
-rw-r--r-- 1 km-nig km-medou 16827738  8月 13 09:35 3-filt_thr1.0.fail.junctions.bed
-rw-r--r-- 1 km-nig km-medou 96083847  8月 13 09:35 3-filt_thr1.0.fail.junctions.exon.gff3
-rw-r--r-- 1 km-nig km-medou 14366452  8月 13 09:35 3-filt_thr1.0.fail.junctions.intron.gff3
-rw-r--r-- 1 km-nig km-medou 40655391  8月 13 09:35 3-filt_thr1.0.fail.junctions.tab
-rw-r--r-- 1 km-nig km-medou 15421883  8月 13 09:35 3-filt_thr1.0.pass.junctions.bed
-rw-r--r-- 1 km-nig km-medou 89236184  8月 13 09:35 3-filt_thr1.0.pass.junctions.exon.gff3
-rw-r--r-- 1 km-nig km-medou 13504913  8月 13 09:35 3-filt_thr1.0.pass.junctions.intron.gff3
-rw-r--r-- 1 km-nig km-medou 41912249  8月 13 09:34 3-filt_thr1.0.pass.junctions.tab
-rw-r--r-- 1 km-nig km-medou  1270361  8月 13 09:34 3-filt_thr1.0.selftrain.forest
-rw-r--r-- 1 km-nig km-medou       42  8月 13 09:32 3-filt_thr1.0.selftrain.initialset.L95_intron_size.txt
-rw-r--r-- 1 km-nig km-medou  1360877  8月 13 09:32 3-filt_thr1.0.selftrain.initialset.neg.junctions.tab
-rw-r--r-- 1 km-nig km-medou 27728445  8月 13 09:32 3-filt_thr1.0.selftrain.initialset.pos.junctions.tab

Thank you for your kind support!!

Dear @KMgoo , thank you for the list. We would need the log, and the initialset set. We can start working from there.

Thank you!

Dear @lucventurini, thank you for your comment.

The log is shown below. I find that some initial pos/neg. layers appeared.

1 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json 2 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json 3 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json 4 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json 5 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json 6 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json 7 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json

These are the files you need? Thank you!

Portcullis V1.2.2

Running portcullis in prepare mode
----------------------------------

Created symlink from "/home/km-nig/Materials201907/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.genome.fa"
Created symlink from "/home/km-nig/Materials201907/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.genome.fa.fai"
Created symlink from "/home/km-nig/006_Portcullis_1/hisat2KM/SRR2313075.bam" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.unsorted.alignments.bam"
Existing BAM index not found.  Will create later.
Provided BAM appears to be sorted already, just creating symlink instead.
Created symlink from "/home/km-nig/006_Portcullis_1/hisat2KM/SRR2313075.bam" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam"
Indexing BAM using command "samtools index ./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam" ... done.
BAM index created at: "./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam.bai"
 - BAM Index - Wall time taken: 173.6s

Portcullis prep completed.
Total runtime: 173.7s

Portcullis V1.2.2

Running portcullis in junction builder mode
------------------------------------------

Settings:
 - BAM Strandedness: UNKNOWN
 - BAM Read Orientation: UNKNOWN
 - BAM Indexing mode: BAI
 - Threads: 16
 - Separate BAMs: false

BAM details:
 - File path: "./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam"
 - # Target sequences: 25
 - Format version: 1.0
 - Sort order: coordinate
 - Program 1: 
  - ID: hisat2
  - Name: hisat2
  - Version: 2.1.0
  - Command line: "/home/km-nig/anaconda3/envs/py3.5/bin/hisat2-align-s --wrapper basic-0 -p 64 -x /home/km-nig/Materials201908/hisat2Index_hg19_UCSC/genomeHT2KM -S SRR2313075.sam -1 SRR2313075_1.fastq -2 SRR2313075_2.fastq"

Creating 16 threads, each with BAM and genome indicies loaded ... done.
Finding junctions and calculating basic metrics:
 - Queueing 25 target sequences for processing in the thread pool
 - Processing: 
   - chrM
   - chr1
   - chr2
   - chr3
   - chr4
   - chr5
   - chr6
   - chr7
   - chr8
   - chr9
   - chr10
   - chr11
   - chr12
   - chr13
   - chr14
   - chr15
   - chr16
   - chr17
   - chr18
   - chr19
   - chr20
   - chr21
   - chr22
   - chrX
   - chrY
 - All threads completed.
 - Combining results from threads.

Sequence           unspliced         spliced           total
chrM                 8772108          127251         8899359
chr1                 9427095         3056477        12483572
chr2                 5300271         2139952         7440223
chr3                 4119539         1432539         5552078
chr4                 2131837          805519         2937356
chr5                 3719125         1613508         5332633
chr6                 5806729         2421709         8228438
chr7                 4155033         1207015         5362048
chr8                 2444822         1029636         3474458
chr9                 3232340         1190075         4422415
chr10                2661696          925370         3587066
chr11                5106894         2075143         7182037
chr12                4786878         2059713         6846591
chr13                1264791          510520         1775311
chr14                3215404         1122615         4338019
chr15                3045267         1317546         4362813
chr16                4272016         1657508         5929524
chr17                4584042         1793207         6377249
chr18                 991597          298220         1289817
chr19                5502688         3176728         8679416
chr20                1902412          657754         2560166
chr21                 731177          225426          956603
chr22                2351539          964510         3316049
chrX                 2735969         1146361         3882330
chrY                  141290           30716          172006

Sorting and reindexing merged junctions... done.

Final stats:
 - Processed 125387577 alignments.
 - Alignment query length statistics: min: 90; mean: 90; max: 90;
 - Found 336989 junctions from 32985018 spliced alignments.
 - Found 92402559 unspliced alignments.
 - Calculating junctions stats that require comparisons with other junctions... done.
 = Wall time taken: 1616.2s

Saving junctions: 
 - Saving junction table to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.tab ... done.
 - Saving junction GFF file to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.exon.gff3 ... done.
 - Saving intron GFF file to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.intron.gff3 ... done.
 - Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.bed ... done.
 = Wall time taken: 29.1s

Strand Analysis
---------------

Total Alignments:
 - R1:17431899
 - R2:17416171
Alignment counts when splice site suggests +ve strand:
 - R1+: 4454948
 - R1-: 4179655
 - R2+: 4426031
 - R2-: 4199256
Alignment counts when splice site suggests -ve strand:
 - R1+: 4254132
 - R1-: 4543164
 - R2+: 4281548
 - R2-: 4509336
Correlation of read strand to splice site strand (1.0 = complete agreement, -1.0 = complete disagreement):
 - R1+: 0.0318825
 - R1-: 0.0328546
 - R2+: 0.0262919
 - R2-: 0.0259118

Determined sequence orientation to be: Paired-End (FR): Forward Reverse (-> <-)
Determined RNAseq strandedness to be: Unstranded - can't determine transcript strand from read strand

Portcullis junc completed.
Total runtime: 1646.3s

Portcullis V1.2.2

Running portcullis in junction filter mode
------------------------------------------

Loading junctions from ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.tab ... done.
Found 336989 junctions.

Self training mode activated.

Executing python script.
Loading input junctions ... done. 336989 junctions loaded.

Creating initial positive set for training
------------------------------------------

Applying the following set of rule-based filters to create initial positive set.
1   //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer1.json
2   //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer2.json
3   //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer3.json

LAYER   PASS    FAIL
1   252666  84323
2   145242  191747
3   108150  228839
Intron size at L95 = 20130  positive set maximum intron size limit set to L95 x 1.2: 24156
4   103788  233201

Positive set contains: 103788 junctions

Saving positive set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.selftrain.initialset.pos.junctions.tab
233201 remaining for consideration as negative set

Creating initial negative set for training
------------------------------------------
Applying a set of rule-based filters to create initial negative set.
1   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json
2   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json
3   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json
4   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json
5   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json
6   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json
7   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json

LAYER   PASS    FAIL
1   2635    230566
2   11  230555
3   21  230534
4   415 230119
5   682 229437
6   0   229437
7   2686    226751
Intron size L95 = 20130 negative set will use junctions with intron size over L95 x 8: 161040 and with maxmmes < 12
8   2   226751

Negative set contains: 6452 junctions
Saving negative set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.selftrain.initialset.neg.junctions.tab

Final train set stats:
 - Positive set: 103788 junctions.
 - Negative set: 6452 junctions.
 - Others: 226749 junctions.

Executed Python script

Initial training set consists of 103788 positive and 6452 negative junctions.

Pos to neg ratio: 0.0585269

Confirming intron length L95 is: 20130
Feature learning from training set ... done.

Training Random Forest
----------------------

Oversampling negative set to balance with positive set using SMOTE
Starting Synthetic Minority Oversampling Technique (SMOTE)
Performing K Nearest Neighbour (KNN) ...  Time taken: 0.1s
SMOTE Time taken: 0.2s

Number of synthesized entries: 96780

Combining positive, negative and synthetic negative datasets.
Initialising random forest
Training
Out of box Error (OOBE): 0

Predicting valid junctions using random forest model
----------------------------------------------------

Creating feature vector
Initialising random forest
Making predictions
Threshold set at 0.5

Random Forest filtering results
-------------------------
Input contained 336989 junctions.
Output contains 281154 junctions.
Filtered out 55835 junctions.

Recalculating junction grouping and distance stats based on new junction list that passed filters ... done.

Overall results
-------------------------
Input contained 336989 junctions.
Output contains 281154 junctions.
Filtered out 55835 junctions.

Saving junctions passing filter to disk:
 - Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.tab ... done.
 - Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.exon.gff3 ... done.
 - Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.intron.gff3 ... done.
 - Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.bed ... done.
 = Wall time taken: 22.5s

Saving junctions failing filter to disk:
 - Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.tab ... done.
 - Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.exon.gff3 ... done.
 - Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.intron.gff3 ... done.
 - Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.bed ... done.
 = Wall time taken: 4.5s

Portcullis junction filter completed.
Total runtime: 196.4s

Portcullis V1.2.2

Running portcullis in junction filter mode
------------------------------------------

Loading junctions from ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.tab ... done.
Found 336989 junctions.

Self training mode activated.

Executing python script.
Loading input junctions ... done. 336989 junctions loaded.

Creating initial positive set for training
------------------------------------------

Applying the following set of rule-based filters to create initial positive set.
1   //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer1.json
2   //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer2.json
3   //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer3.json

LAYER   PASS    FAIL
1   252666  84323
2   145242  191747
3   108150  228839
Intron size at L95 = 20130  positive set maximum intron size limit set to L95 x 1.2: 24156
4   103788  233201

Positive set contains: 103788 junctions

Saving positive set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.selftrain.initialset.pos.junctions.tab
233201 remaining for consideration as negative set

Creating initial negative set for training
------------------------------------------
Applying a set of rule-based filters to create initial negative set.
1   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json
2   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json
3   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json
4   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json
5   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json
6   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json
7   //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json

LAYER   PASS    FAIL
1   2635    230566
2   11  230555
3   21  230534
4   415 230119
5   682 229437
6   0   229437
7   2686    226751
Intron size L95 = 20130 negative set will use junctions with intron size over L95 x 8: 161040 and with maxmmes < 12
8   2   226751

Negative set contains: 6452 junctions
Saving negative set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.selftrain.initialset.neg.junctions.tab

Final train set stats:
 - Positive set: 103788 junctions.
 - Negative set: 6452 junctions.
 - Others: 226749 junctions.

Executed Python script

Initial training set consists of 103788 positive and 6452 negative junctions.

Pos to neg ratio: 0.0585269

Confirming intron length L95 is: 20130
Feature learning from training set ... done.

Training Random Forest
----------------------

Oversampling negative set to balance with positive set using SMOTE
Starting Synthetic Minority Oversampling Technique (SMOTE)
Performing K Nearest Neighbour (KNN) ...  Time taken: 0.1s
SMOTE Time taken: 0.2s

Number of synthesized entries: 96780

Combining positive, negative and synthetic negative datasets.
Initialising random forest
Training
Out of box Error (OOBE): 0

Predicting valid junctions using random forest model
----------------------------------------------------

Creating feature vector
Initialising random forest
Making predictions
Threshold set at 1

Random Forest filtering results
-------------------------
Input contained 336989 junctions.
Output contains 161314 junctions.
Filtered out 175675 junctions.

Recalculating junction grouping and distance stats based on new junction list that passed filters ... done.

Overall results
-------------------------
Input contained 336989 junctions.
Output contains 161314 junctions.
Filtered out 175675 junctions.

Saving junctions passing filter to disk:
 - Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.tab ... done.
 - Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.exon.gff3 ... done.
 - Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.intron.gff3 ... done.
 - Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.bed ... done.
 = Wall time taken: 13.1s

Saving junctions failing filter to disk:
 - Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.tab ... done.
 - Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.exon.gff3 ... done.
 - Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.intron.gff3 ... done.
 - Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.bed ... done.
 = Wall time taken: 14.3s

Portcullis junction filter completed.
Total runtime: 195.7s

Dear @KMgoo , yes, the layers are exactly what we need! If you could send them over, we can check where the problem starts from.

Dear @lucventurini , here is the zipped file. Could you download it from my dropbox and analyze it? Thank you so much!! https://www.dropbox.com/s/62kro78pybvjn0y/balanced.zip?dl=0

1 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer1.json 2 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer2.json 3 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer3.json 1 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json 2 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json 3 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json 4 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json 5 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json 6 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json 7 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json

I compared the results of junction analysis and junction filtering again. I understand that the filtering of portcullis can filter exons from the list of junction analysis, but it does not add new exons to the list. I expect that portcullis identify true exons, such as chr1: 154570304-154570452 and 154570878-154571061, but these are not included in the list of the results of junction analysis. So I think the cause is probably due to junction analysis, not filtering.

Then I manually drew "exons" which were included in the result of junction analysis. I find that these true exons are recognized in small pieces. In addition, "match_part (blue arrow)" is part of true exons, but "match (red arrow)" is not.

Next, I manually drew "introns" of the result of the junction filtering. The threshold was 0.5. I think these results were excellent!

Any comments would be highly appreciated.

Dear @KMgoo , I think that there has been definitely a misunderstanding ... Portcullis indeed is a tool to identify correct introns from the data, not exons. I was operating under the misunderstanding that you were concerned that a real intron was missing. I would have probably realised that after looking at the data, but you did quite a good job at analysing the results yourself!

Many apologies for the confusion, I should have caught it out earlier.

If your interest is in the transcripts, then the usual path is to assemble transcripts and check those results. Portcullis can help by flagging transcripts that incorporate spurious junctions.

Our tool Mikado was written for this purpose.

Dear @lucventurini , thank you for your comment. I fully understand what portcullis can do.

In my project, I want to know the change in RNA splicing, such as alternative 3'ss, alternative 5'ss, exon inclusion, and exon skipping, by comparing bam files. For this purpose, which is better, MIkado or Portcullis?

Thank you a lot!!

Dear @KMgoo , unfortunately I do not think that either is perfect for the purpose.

Portcullis is capable of determining which junctions are correct and which are not, so it should provide a method to throw out spurious splice junctions. It will not be able to categorize them, but (through junctools compare) it should be able to tell you which junctions are novel in the dataset.

Mikado is capable of giving a more fine grained analysis ... of transcript models. If you were to assemble your reads into transcripts, you could use mikado compare to verify which transcripts are identical and which are not. However, most transcript assemblies are questionable - that is why Mikado exists in the first place.

Neither tool is, unfortunately, capable of characterising the splice junctions directly, in the way that you are describing.

Dear @lucventurini , thank you for your comment. I will use junctools compare and will consider how to categorize the novel junctions later. Portcullis's ability to detect junctions is reliable!!

I'm trying junctools compare now, but I have some questions. I may ask you again next time.

I will close this ticket now. I'm really thankful to you.

Sincerely, KMgoo from Japan

EI-CoreBioinformatics / portcullis

Exon is not exactly recognized #47