Open lucventurini opened 5 years ago
Hey @lucventurini, good to know portcullis is working on long reads. At the moment the --balanced option is disabled (it always uses balanced by default so strictly speaking it's the precise option that is disabled), so to activate it you either need to recompile, changing the last argument of line 756 https://github.com/maplesond/portcullis/blob/master/src/junction_filter.cc to true. Or if you don't want to recompile go into the data directory backup the current balanced directory and replace with the new rules.
What you suggest is easily doable though providing we have suitable rules to work with. If you have a set that works well let me know and I'll try and find some time to add the extra command line options in and necessary logic.
By the way, I suspect the reason the strand calling doesn't work well for long reads is that we currently need 95% of alignments to agree to call the read strand column, which feeds into the consensus strand column. Possibly the depth of the longs reads is meaning these can easily get set to unknown?
Or if you don't want to recompile go into the data directory backup the current balanced directory and replace with the new rules.
For the time being that is what I am doing, for the purposes of testing. Having a command line switch (and maybe even the possibility of providing a custom folder yourself) will be more elegant in the long run, but for it will be enough.
What you suggest is easily doable though providing we have suitable rules to work with. If you have a set that works well let me know and I'll try and find some time to add the extra command line options in and necessary logic.
Not yet, working on it. Not surprisingly, the "mean_mismatches" parameter is quite harmful though - practically all long junctions have at least 30 or so mismatches per read on average. Maxmmes is also quite useless for the opposite reason: practically all junctions have on both sides an anchor as long as an exon, making this a very weak filter.
By the way, I suspect the reason the strand calling doesn't work well for long reads is that we currently need 95% of alignments to agree to call the read strand column, which feeds into the consensus strand column. Possibly the depth of the longs reads is meaning these can easily get set to unknown?
I am not so sure about that. The dataset I am analysing (here: https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md) is extremely well covered (each BAM has ~19 million reads). Although I am analysing only chr22 for the time being, I daresay that coverage should not be an issue for the strand detection algorithm.
Could it be that Heng Li himself broke the BAM compatibility in MiniMap2, and instead of using the XS tag for the strand, he uses the ts tag instead?
Hi @lucventurini,
That sounds logical, makes sense to reduce or eliminate the use of MaxMMES and mean mismatches for nanopore. Interesting that you have that many reads, I kind of assumed you'd have a lot less for nanopore, what kind of average depth do you have on your genome?
Regarding XS tag, that makes life easier for portcullis if present but isn't absolutely required, at least for Illumina paired end. The relevant logic is here: https://github.com/maplesond/portcullis/blob/master/lib/src/bam_alignment.cc#L93.
Dan
Hi @maplesond ,
Interesting that you have that many reads, I kind of assumed you'd have a lot less for nanopore, what kind of average depth do you have on your genome?
This is not my experiment, so I do not know whether the yield is representative. However, for this particular example, on chromosome 22 I see that junctions have an average coverage of ~50X. I am not sure I would use the nb_* metrics for this instance, though, until I verify that this high coverage is not something obtained through stacking multiple runs together.
Regarding XS tag, that makes life easier for portcullis if present but isn't absolutely required, at least for Illumina paired end. The relevant logic is here: https://github.com/maplesond/portcullis/blob/master/lib/src/bam_alignment.cc#L93.
I see. It's strange then that the read strand is not inferred correctly - maybe it's a more generic bug with single-end reads?
More in general, I have been playing around with the data, and I am seeing that the JAD metrics seem to be very predictive of the correctness of a particular junction, for long reads. E.g.:
$ j_index.correct.value_counts()
False 15471
True 3915
$ j_index[(j_index.JAD18 > 0)].correct.value_counts()
True 1985
False 275
$ j_index[(j_index.JAD18 > 0) & (j_index.suspicious == 0) & (j_index.entropy > 1)].correct.value_counts()
True 1920
False 139
$ j_index[(j_index.JAD18 > 0) & (j_index.suspicious == 0) & (j_index.entropy > 1) & (j_index.primary_junc == 1)].correct.value_counts()
True 1791
False 43
# Without the JAD metric, the precision is much worse
$ j_index[(j_index.suspicious == 0) & (j_index.entropy > 1) & (j_index.primary_junc == 1) ].correct.value_counts()
True 2676
False 537
I have been reading the documentation and the definition in the original FineSplice article, but I am still not completely clear on what these metrics represent and - more crucially - whether it would be foolhardy to use them for the positive layer. The predictive power associated with JAD values over ~10 is, very, very high though, so this could really be an easy win!
Hi @maplesond, So reading the code in junction.cc, I think I understand that to calculate the JAD score, portcullis checks the minimum distance between one of the junction borders and the first mismatch on the read. So a read with a mismatch 10 bps from the left junction dude and another 5 bps from the right junction side will increase the JAD scores from 1 to 5, but not any else. So the JAD scores are a measure of goodness of the alignment between the reads and the genomic string surrounding the junction.
Is my understanding correct?
Hi @lucventurini, yes that's basically correct. The logic essentially trims the alignment to the closest mismatch to the junctions site (either upstream or downstream). It also takes into account deviation of expected coverage around the junction. So with short reads it would be expected to drop off quick the further you are away from the junction site. With longer reads you'd expect a slower drop off.
OK, thanks for clarifying! Would you think that basing an initial filter on, say, JAD15>=3, would be sensible?
I suspect it will depend on how deep the coverage is but with long reads I guess that setting sounds reasonable.
Also I may have been wrong in my previous statement regarding trimming. It definitely takes mismatches into account but not 100% sure if it effectively trims after thinking about it. You'd need to check the code.
Hi @maplesond , apologies for not having replied earlier! I have been actively working on ideas to make the selection function with nanopore reads (see the "longreads" branch). The wall I am hitting at the moment is that the quality of data seems really bad .. on RNA reads the junction precision, off the aligner, is ~20%. This is still recoverable as I can get to ~80% precision (depending on how much I want to be biased towards junction recall). cDNA reads, on the other hand, seem like an almost desperate case. The junction precision out of the gate is about ~3% (yes, three). I can get to 30%, so a 10X increase, but I would not call it a victory yet.
In any case, now Portcullis will accept custom rules on the fly and outside of the installation folder. I am also working on getting some new metrics up to help with Nanopore reads. Please see the branch itself for details!
Hi @lucventurini, how's this going? Any changes to migrate over? I've been working on a bunch of improvements to the build system in portcullis, and fixing a few bugs at the same time. I was planning to get a new release out at some point, so would be nice to get this change in too.
Hi @maplesond, I had to put this on the backburner for a little bit while I had to work on other projects. I should be receiving some data soon, which should allow me to continue development on this.
I will let you know ASAP.
Hi Dan, I have been playing with portcullis for long reads (Nanopore) and I see some potential - however, it does require changing a bit the conditions for the self training. I can experiment on that side - it's pure Python code - but I wonder, would it be possible to have the filter command line select between different pre-sets? For example, at the moment portcullis ships with two JSON presets ("balanced" and "precise") but I cannot see any way of selecting either through the command line. So we could add a parameter to the command line like:
PS: just to reiterate: Portcullis functions out of the box with long read BAMs (I think it does not correctly call the strandedness, but it's not fundamental). It only needs some adjustment on the training rules, and the ability to choose at run time.