Closed KMgoo closed 5 years ago
Dear @KMgoo , thank you for reporting. The first step here is to verify whether the exon you are looking for has been mistakenly tagged as one of the "true negatives" in the first pass of portcullis. If that is the case, then that would explain why you are missing the intron.
Inside the filtering folder, there are some .tab
files which list which junctions have been selected as "true negatives" (neg_layer
files) and "true positives" (pos_layer
).
It would be important to verify that the missing exon has not been incorrectly recognised as a false positive by the first pass. Would you please be able to check that, or even better, maybe send us the layer files within the filtering folder?
Another possibility is that the intron is recognised as potentially valid, but with a low probability. This could be verified by filtering with a lower threshold, e.g. 0.1 (a ludicrously low threshold, which I only recommend for this limited type of testing).
Thank you again for reporting!
Hi, @lucventurini, apologies for the late reply. Although I tried portcullis with several options, I could not obtain neg_layer/ pos_layer files in the filtering folder. The filtering folder is shown below. Which files should I send?
$ ls -l 3-filt_thr1.0/
合計 350012
-rw-r--r-- 1 km-nig km-medou 16827738 8月 13 09:35 3-filt_thr1.0.fail.junctions.bed
-rw-r--r-- 1 km-nig km-medou 96083847 8月 13 09:35 3-filt_thr1.0.fail.junctions.exon.gff3
-rw-r--r-- 1 km-nig km-medou 14366452 8月 13 09:35 3-filt_thr1.0.fail.junctions.intron.gff3
-rw-r--r-- 1 km-nig km-medou 40655391 8月 13 09:35 3-filt_thr1.0.fail.junctions.tab
-rw-r--r-- 1 km-nig km-medou 15421883 8月 13 09:35 3-filt_thr1.0.pass.junctions.bed
-rw-r--r-- 1 km-nig km-medou 89236184 8月 13 09:35 3-filt_thr1.0.pass.junctions.exon.gff3
-rw-r--r-- 1 km-nig km-medou 13504913 8月 13 09:35 3-filt_thr1.0.pass.junctions.intron.gff3
-rw-r--r-- 1 km-nig km-medou 41912249 8月 13 09:34 3-filt_thr1.0.pass.junctions.tab
-rw-r--r-- 1 km-nig km-medou 1270361 8月 13 09:34 3-filt_thr1.0.selftrain.forest
-rw-r--r-- 1 km-nig km-medou 42 8月 13 09:32 3-filt_thr1.0.selftrain.initialset.L95_intron_size.txt
-rw-r--r-- 1 km-nig km-medou 1360877 8月 13 09:32 3-filt_thr1.0.selftrain.initialset.neg.junctions.tab
-rw-r--r-- 1 km-nig km-medou 27728445 8月 13 09:32 3-filt_thr1.0.selftrain.initialset.pos.junctions.tab
Thank you for your kind support!!
Dear @KMgoo , thank you for the list. We would need the log, and the initialset
set. We can start working from there.
Thank you!
Dear @lucventurini, thank you for your comment.
The log is shown below. I find that some initial pos/neg. layers appeared.
1 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer1.json 2 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer2.json 3 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer3.json
1 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json 2 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json 3 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json 4 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json 5 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json 6 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json 7 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json
These are the files you need? Thank you!
Portcullis V1.2.2
Running portcullis in prepare mode
----------------------------------
Created symlink from "/home/km-nig/Materials201907/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.genome.fa"
Created symlink from "/home/km-nig/Materials201907/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.genome.fa.fai"
Created symlink from "/home/km-nig/006_Portcullis_1/hisat2KM/SRR2313075.bam" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.unsorted.alignments.bam"
Existing BAM index not found. Will create later.
Provided BAM appears to be sorted already, just creating symlink instead.
Created symlink from "/home/km-nig/006_Portcullis_1/hisat2KM/SRR2313075.bam" to "./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam"
Indexing BAM using command "samtools index ./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam" ... done.
BAM index created at: "./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam.bai"
- BAM Index - Wall time taken: 173.6s
Portcullis prep completed.
Total runtime: 173.7s
Portcullis V1.2.2
Running portcullis in junction builder mode
------------------------------------------
Settings:
- BAM Strandedness: UNKNOWN
- BAM Read Orientation: UNKNOWN
- BAM Indexing mode: BAI
- Threads: 16
- Separate BAMs: false
BAM details:
- File path: "./portcullis_out_7_SRR2313075/1-prep/portcullis.sorted.alignments.bam"
- # Target sequences: 25
- Format version: 1.0
- Sort order: coordinate
- Program 1:
- ID: hisat2
- Name: hisat2
- Version: 2.1.0
- Command line: "/home/km-nig/anaconda3/envs/py3.5/bin/hisat2-align-s --wrapper basic-0 -p 64 -x /home/km-nig/Materials201908/hisat2Index_hg19_UCSC/genomeHT2KM -S SRR2313075.sam -1 SRR2313075_1.fastq -2 SRR2313075_2.fastq"
Creating 16 threads, each with BAM and genome indicies loaded ... done.
Finding junctions and calculating basic metrics:
- Queueing 25 target sequences for processing in the thread pool
- Processing:
- chrM
- chr1
- chr2
- chr3
- chr4
- chr5
- chr6
- chr7
- chr8
- chr9
- chr10
- chr11
- chr12
- chr13
- chr14
- chr15
- chr16
- chr17
- chr18
- chr19
- chr20
- chr21
- chr22
- chrX
- chrY
- All threads completed.
- Combining results from threads.
Sequence unspliced spliced total
chrM 8772108 127251 8899359
chr1 9427095 3056477 12483572
chr2 5300271 2139952 7440223
chr3 4119539 1432539 5552078
chr4 2131837 805519 2937356
chr5 3719125 1613508 5332633
chr6 5806729 2421709 8228438
chr7 4155033 1207015 5362048
chr8 2444822 1029636 3474458
chr9 3232340 1190075 4422415
chr10 2661696 925370 3587066
chr11 5106894 2075143 7182037
chr12 4786878 2059713 6846591
chr13 1264791 510520 1775311
chr14 3215404 1122615 4338019
chr15 3045267 1317546 4362813
chr16 4272016 1657508 5929524
chr17 4584042 1793207 6377249
chr18 991597 298220 1289817
chr19 5502688 3176728 8679416
chr20 1902412 657754 2560166
chr21 731177 225426 956603
chr22 2351539 964510 3316049
chrX 2735969 1146361 3882330
chrY 141290 30716 172006
Sorting and reindexing merged junctions... done.
Final stats:
- Processed 125387577 alignments.
- Alignment query length statistics: min: 90; mean: 90; max: 90;
- Found 336989 junctions from 32985018 spliced alignments.
- Found 92402559 unspliced alignments.
- Calculating junctions stats that require comparisons with other junctions... done.
= Wall time taken: 1616.2s
Saving junctions:
- Saving junction table to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.tab ... done.
- Saving junction GFF file to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.exon.gff3 ... done.
- Saving intron GFF file to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.intron.gff3 ... done.
- Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.bed ... done.
= Wall time taken: 29.1s
Strand Analysis
---------------
Total Alignments:
- R1:17431899
- R2:17416171
Alignment counts when splice site suggests +ve strand:
- R1+: 4454948
- R1-: 4179655
- R2+: 4426031
- R2-: 4199256
Alignment counts when splice site suggests -ve strand:
- R1+: 4254132
- R1-: 4543164
- R2+: 4281548
- R2-: 4509336
Correlation of read strand to splice site strand (1.0 = complete agreement, -1.0 = complete disagreement):
- R1+: 0.0318825
- R1-: 0.0328546
- R2+: 0.0262919
- R2-: 0.0259118
Determined sequence orientation to be: Paired-End (FR): Forward Reverse (-> <-)
Determined RNAseq strandedness to be: Unstranded - can't determine transcript strand from read strand
Portcullis junc completed.
Total runtime: 1646.3s
Portcullis V1.2.2
Running portcullis in junction filter mode
------------------------------------------
Loading junctions from ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.tab ... done.
Found 336989 junctions.
Self training mode activated.
Executing python script.
Loading input junctions ... done. 336989 junctions loaded.
Creating initial positive set for training
------------------------------------------
Applying the following set of rule-based filters to create initial positive set.
1 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer1.json
2 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer2.json
3 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer3.json
LAYER PASS FAIL
1 252666 84323
2 145242 191747
3 108150 228839
Intron size at L95 = 20130 positive set maximum intron size limit set to L95 x 1.2: 24156
4 103788 233201
Positive set contains: 103788 junctions
Saving positive set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.selftrain.initialset.pos.junctions.tab
233201 remaining for consideration as negative set
Creating initial negative set for training
------------------------------------------
Applying a set of rule-based filters to create initial negative set.
1 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json
2 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json
3 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json
4 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json
5 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json
6 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json
7 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json
LAYER PASS FAIL
1 2635 230566
2 11 230555
3 21 230534
4 415 230119
5 682 229437
6 0 229437
7 2686 226751
Intron size L95 = 20130 negative set will use junctions with intron size over L95 x 8: 161040 and with maxmmes < 12
8 2 226751
Negative set contains: 6452 junctions
Saving negative set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.selftrain.initialset.neg.junctions.tab
Final train set stats:
- Positive set: 103788 junctions.
- Negative set: 6452 junctions.
- Others: 226749 junctions.
Executed Python script
Initial training set consists of 103788 positive and 6452 negative junctions.
Pos to neg ratio: 0.0585269
Confirming intron length L95 is: 20130
Feature learning from training set ... done.
Training Random Forest
----------------------
Oversampling negative set to balance with positive set using SMOTE
Starting Synthetic Minority Oversampling Technique (SMOTE)
Performing K Nearest Neighbour (KNN) ... Time taken: 0.1s
SMOTE Time taken: 0.2s
Number of synthesized entries: 96780
Combining positive, negative and synthetic negative datasets.
Initialising random forest
Training
Out of box Error (OOBE): 0
Predicting valid junctions using random forest model
----------------------------------------------------
Creating feature vector
Initialising random forest
Making predictions
Threshold set at 0.5
Random Forest filtering results
-------------------------
Input contained 336989 junctions.
Output contains 281154 junctions.
Filtered out 55835 junctions.
Recalculating junction grouping and distance stats based on new junction list that passed filters ... done.
Overall results
-------------------------
Input contained 336989 junctions.
Output contains 281154 junctions.
Filtered out 55835 junctions.
Saving junctions passing filter to disk:
- Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.tab ... done.
- Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.exon.gff3 ... done.
- Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.intron.gff3 ... done.
- Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.pass.junctions.bed ... done.
= Wall time taken: 22.5s
Saving junctions failing filter to disk:
- Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.tab ... done.
- Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.exon.gff3 ... done.
- Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.intron.gff3 ... done.
- Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr0.5/3-filt_thr0.5.fail.junctions.bed ... done.
= Wall time taken: 4.5s
Portcullis junction filter completed.
Total runtime: 196.4s
Portcullis V1.2.2
Running portcullis in junction filter mode
------------------------------------------
Loading junctions from ./portcullis_out_7_SRR2313075/2-junc/2-junc.junctions.tab ... done.
Found 336989 junctions.
Self training mode activated.
Executing python script.
Loading input junctions ... done. 336989 junctions loaded.
Creating initial positive set for training
------------------------------------------
Applying the following set of rule-based filters to create initial positive set.
1 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer1.json
2 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer2.json
3 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer3.json
LAYER PASS FAIL
1 252666 84323
2 145242 191747
3 108150 228839
Intron size at L95 = 20130 positive set maximum intron size limit set to L95 x 1.2: 24156
4 103788 233201
Positive set contains: 103788 junctions
Saving positive set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.selftrain.initialset.pos.junctions.tab
233201 remaining for consideration as negative set
Creating initial negative set for training
------------------------------------------
Applying a set of rule-based filters to create initial negative set.
1 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json
2 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json
3 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json
4 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json
5 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json
6 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json
7 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json
LAYER PASS FAIL
1 2635 230566
2 11 230555
3 21 230534
4 415 230119
5 682 229437
6 0 229437
7 2686 226751
Intron size L95 = 20130 negative set will use junctions with intron size over L95 x 8: 161040 and with maxmmes < 12
8 2 226751
Negative set contains: 6452 junctions
Saving negative set to disk ... done. File saved to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.selftrain.initialset.neg.junctions.tab
Final train set stats:
- Positive set: 103788 junctions.
- Negative set: 6452 junctions.
- Others: 226749 junctions.
Executed Python script
Initial training set consists of 103788 positive and 6452 negative junctions.
Pos to neg ratio: 0.0585269
Confirming intron length L95 is: 20130
Feature learning from training set ... done.
Training Random Forest
----------------------
Oversampling negative set to balance with positive set using SMOTE
Starting Synthetic Minority Oversampling Technique (SMOTE)
Performing K Nearest Neighbour (KNN) ... Time taken: 0.1s
SMOTE Time taken: 0.2s
Number of synthesized entries: 96780
Combining positive, negative and synthetic negative datasets.
Initialising random forest
Training
Out of box Error (OOBE): 0
Predicting valid junctions using random forest model
----------------------------------------------------
Creating feature vector
Initialising random forest
Making predictions
Threshold set at 1
Random Forest filtering results
-------------------------
Input contained 336989 junctions.
Output contains 161314 junctions.
Filtered out 175675 junctions.
Recalculating junction grouping and distance stats based on new junction list that passed filters ... done.
Overall results
-------------------------
Input contained 336989 junctions.
Output contains 161314 junctions.
Filtered out 175675 junctions.
Saving junctions passing filter to disk:
- Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.tab ... done.
- Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.exon.gff3 ... done.
- Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.intron.gff3 ... done.
- Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.pass.junctions.bed ... done.
= Wall time taken: 13.1s
Saving junctions failing filter to disk:
- Saving junction table to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.tab ... done.
- Saving junction GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.exon.gff3 ... done.
- Saving intron GFF file to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.intron.gff3 ... done.
- Saving BED file with all junctions to: ./portcullis_out_7_SRR2313075/3-filt_thr1.0/3-filt_thr1.0.fail.junctions.bed ... done.
= Wall time taken: 14.3s
Portcullis junction filter completed.
Total runtime: 195.7s
Dear @KMgoo , yes, the layers are exactly what we need! If you could send them over, we can check where the problem starts from.
Dear @lucventurini , here is the zipped file. Could you download it from my dropbox and analyze it? Thank you so much!! https://www.dropbox.com/s/62kro78pybvjn0y/balanced.zip?dl=0
1 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer1.json 2 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer2.json 3 //portcullis/share/portcullis/balanced/selftrain_initial_pos.layer3.json 1 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer1.json 2 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer2.json 3 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer3.json 4 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer4.json 5 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer5.json 6 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer6.json 7 //portcullis/share/portcullis/balanced/selftrain_initial_neg.layer7.json
I compared the results of junction analysis and junction filtering again. I understand that the filtering of portcullis can filter exons from the list of junction analysis, but it does not add new exons to the list. I expect that portcullis identify true exons, such as chr1: 154570304-154570452 and 154570878-154571061, but these are not included in the list of the results of junction analysis. So I think the cause is probably due to junction analysis, not filtering.
Then I manually drew "exons" which were included in the result of junction analysis. I find that these true exons are recognized in small pieces. In addition, "match_part (blue arrow)" is part of true exons, but "match (red arrow)" is not.
Next, I manually drew "introns" of the result of the junction filtering. The threshold was 0.5. I think these results were excellent!
Any comments would be highly appreciated.
Dear @KMgoo , I think that there has been definitely a misunderstanding ... Portcullis indeed is a tool to identify correct introns from the data, not exons. I was operating under the misunderstanding that you were concerned that a real intron was missing. I would have probably realised that after looking at the data, but you did quite a good job at analysing the results yourself!
Many apologies for the confusion, I should have caught it out earlier.
If your interest is in the transcripts, then the usual path is to assemble transcripts and check those results. Portcullis can help by flagging transcripts that incorporate spurious junctions.
Our tool Mikado was written for this purpose.
Dear @lucventurini , thank you for your comment. I fully understand what portcullis can do.
In my project, I want to know the change in RNA splicing, such as alternative 3'ss, alternative 5'ss, exon inclusion, and exon skipping, by comparing bam files. For this purpose, which is better, MIkado or Portcullis?
Thank you a lot!!
Dear @KMgoo , unfortunately I do not think that either is perfect for the purpose.
Portcullis is capable of determining which junctions are correct and which are not, so it should provide a method to throw out spurious splice junctions. It will not be able to categorize them, but (through junctools compare
) it should be able to tell you which junctions are novel in the dataset.
Mikado is capable of giving a more fine grained analysis ... of transcript models. If you were to assemble your reads into transcripts, you could use mikado compare
to verify which transcripts are identical and which are not. However, most transcript assemblies are questionable - that is why Mikado exists in the first place.
Neither tool is, unfortunately, capable of characterising the splice junctions directly, in the way that you are describing.
Dear @lucventurini , thank you for your comment. I will use junctools compare
and will consider how to categorize the novel junctions later. Portcullis's ability to detect junctions is reliable!!
I'm trying junctools compare
now, but I have some questions. I may ask you again next time.
I will close this ticket now. I'm really thankful to you.
Sincerely, KMgoo from Japan
Hi,
Now I'm using portcullis for a published data as trial. Mapping with hisat2 is successful as shown below on IGV.
Then, I'm running portcullis, but exon seems not exactly recognized. According to the closed issue #44, I tried --threshold 1 and 0.5 and compared their results as shown below.
I found that 1.0 is better than 0.5, but 1.0 is still not the best, I think.
For example, chr1:154570304-154570452 is a true exon of ADAR as shown above on IGV. However, when I checked the pass.junctions.exon.gff3, the exon is not correctly recognized.
I think "--threshold 1" may still be too sensitive. Any help would be greatly appreciated,
Thanks,
KM