Open fgvieira opened 4 years ago
@fgvieira I like the idea of adding an N
to the 6bp UMIs in the file of "expected UMIs". Then you can use the new option --allow-ns=true
I implemented here: https://github.com/fulcrumgenomics/fgbio/pull/558
@fgvieira can you report back if that worked for you?
@nh13, thanks for the changes. It seems to work fine. The only thing would be that it saves the UMI with the N:
RX:Z:ACGAGCA-GAAGTGN
I don't think it will affect downstream analyses, but wouldn't it make more sense to remove the N
?
Actually, having N
s int he UMIs affect downstream analyses, namely the GroupReadsByUmi
step:
[2020/01/06 14:17:28 | GroupReadsByUmi | Info] Filtered out 65,529,916 reads that contained one or more Ns in their UMIs.
I guess we can either remove the N
s from the UMI tag and allow for different size UMIs on Paired
GroupReadsByUmi
, or also allow for UMIs with N
s on GroupReadsByUmi
.
@fgvieira I see three option:
CorrectUmis
converts all N
s to A
sGropuReadsByUmi
allows variable length tags (hard)GroupReadsByUmi
allows N
s in the UMI. We could differentiate lower case n
s versus upper case N
s, where the former is given as part of the UMI input to CorrectUmis
and the latter are masked/unknown basecalls from the sequencer. I prefer the latter. @tfenne I'd like you to weigh in here too.
If option 2 is hard then I'd also prefer option 3, but I find a bit confusing having both N
s and n
s with different meanings... can we use another character? For example:
N
: for masked/unknown bases that count as mismatchesX
: for ignored bases that do not count as mismatchesIf so, then it would also be nice if the same code was used throughout fgbio
(i.e. on CorrectUmis
).
I think it would be best to pursue @nh13's option 2 - allowing GroupReadsByUmi
to deal with variable length UMIs. I think we could accomplish this in a semi-trivial fashion just by partitioning on UMI length early in the grouping step. I.e. if you give mixed length input, it will assume that a 6bp UMI and a 7bp UMI can't possibly be the same underlying UMI, so it would split the pool by UMI length(s) and then proceed on as it does now. That seems to be, having not looked at the code recently, like it shouldn't be too hard to implement. And it would represent reality the best, which I think is always a good idea.
Looking at the code I think this could be accomplished via a refactoring of the assignUmiGroups
method in GroupReadsByUmi
. We'd probably want a more optimal implementation but I think it boils down to changing:
val rawToId = this.assigner.assign(umis)
to
val byLen = umis.groupBy(u => u.split("-").map(_.length)).values
val rawToId = byLen.flatMap(us => this.assigner.assign(us))
So we'll need to update CorrectUmis
as well. @tfenne I'm a bit busy over the next week with client work, so I'll leave this unassigned in case either of us want to pick this up.
Hi, I am facing the same problem. Any updates on this?
@nguyenjoshvo we started work on this here but have not had time to push it over the line.
@nh13 Did you have time to push the changes?
@bounlu I have not had time, as I am prioritizing client work. If you'd like to sponsor the team, I am sure I can find someone on the team to help push it over the finish line.
Dear all,
I've been trying to use
fgbio CorrectUmis
to process some data from Illumina's TSO500, but they use a mixture of 6bp and 7bp UMIs.Since
bcl2fastq
needs a fixed length UMI, we've set it to 7 but that means the 6 bps UMIs come with an extra base. I've tried setting the UMI asAACCGCN
, but that means the last base always counts as a mismatch and theN
is included in theRX
field.Would it be possible for
fgbio CorrectUmis
to deal with this kind of data? For example, one could specify the UMI as (AACCGCX
), where the last base is ignored (not counted as a mismatch) and removed from theRX
tag.thanks,