CorrectUmis: allow for different length UMIs

fgvieira commented 4 years ago

Dear all,

I've been trying to use fgbio CorrectUmis to process some data from Illumina's TSO500, but they use a mixture of 6bp and 7bp UMIs.

Since bcl2fastq needs a fixed length UMI, we've set it to 7 but that means the 6 bps UMIs come with an extra base. I've tried setting the UMI as AACCGCN, but that means the last base always counts as a mismatch and the N is included in the RX field.

Would it be possible for fgbio CorrectUmis to deal with this kind of data? For example, one could specify the UMI as (AACCGCX), where the last base is ignored (not counted as a mismatch) and removed from the RX tag.

thanks,

nh13 commented 4 years ago

@fgvieira I like the idea of adding an N to the 6bp UMIs in the file of "expected UMIs". Then you can use the new option --allow-ns=true I implemented here: https://github.com/fulcrumgenomics/fgbio/pull/558

nh13 commented 4 years ago

@fgvieira can you report back if that worked for you?

fgvieira commented 4 years ago

@nh13, thanks for the changes. It seems to work fine. The only thing would be that it saves the UMI with the N:

RX:Z:ACGAGCA-GAAGTGN

I don't think it will affect downstream analyses, but wouldn't it make more sense to remove the N?

fgvieira commented 4 years ago

Actually, having Ns int he UMIs affect downstream analyses, namely the GroupReadsByUmi step:

[2020/01/06 14:17:28 | GroupReadsByUmi | Info] Filtered out 65,529,916 reads that contained one or more Ns in their UMIs.

I guess we can either remove the Ns from the UMI tag and allow for different size UMIs on Paired GroupReadsByUmi, or also allow for UMIs with Ns on GroupReadsByUmi.

nh13 commented 4 years ago

@fgvieira I see three option:

CorrectUmis converts all Ns to As
GropuReadsByUmi allows variable length tags (hard)
GroupReadsByUmi allows Ns in the UMI. We could differentiate lower case ns versus upper case Ns, where the former is given as part of the UMI input to CorrectUmis and the latter are masked/unknown basecalls from the sequencer.

I prefer the latter. @tfenne I'd like you to weigh in here too.

fgvieira commented 4 years ago

If option 2 is hard then I'd also prefer option 3, but I find a bit confusing having both Ns and ns with different meanings... can we use another character? For example:

N: for masked/unknown bases that count as mismatches
X: for ignored bases that do not count as mismatches

If so, then it would also be nice if the same code was used throughout fgbio (i.e. on CorrectUmis).

tfenne commented 4 years ago

I think it would be best to pursue @nh13's option 2 - allowing GroupReadsByUmi to deal with variable length UMIs. I think we could accomplish this in a semi-trivial fashion just by partitioning on UMI length early in the grouping step. I.e. if you give mixed length input, it will assume that a 6bp UMI and a 7bp UMI can't possibly be the same underlying UMI, so it would split the pool by UMI length(s) and then proceed on as it does now. That seems to be, having not looked at the code recently, like it shouldn't be too hard to implement. And it would represent reality the best, which I think is always a good idea.

Looking at the code I think this could be accomplished via a refactoring of the assignUmiGroups method in GroupReadsByUmi. We'd probably want a more optimal implementation but I think it boils down to changing:

    val rawToId = this.assigner.assign(umis)

to

    val byLen = umis.groupBy(u => u.split("-").map(_.length)).values
    val rawToId = byLen.flatMap(us => this.assigner.assign(us))

nh13 commented 4 years ago

So we'll need to update CorrectUmis as well. @tfenne I'm a bit busy over the next week with client work, so I'll leave this unassigned in case either of us want to pick this up.

nguyenjoshvo commented 2 years ago

Hi, I am facing the same problem. Any updates on this?

nh13 commented 2 years ago

@nguyenjoshvo we started work on this here but have not had time to push it over the line.

bounlu commented 4 months ago

@nh13 Did you have time to push the changes?

nh13 commented 4 months ago

@bounlu I have not had time, as I am prioritizing client work. If you'd like to sponsor the team, I am sure I can find someone on the team to help push it over the finish line.

fulcrumgenomics / fgbio

CorrectUmis: allow for different length UMIs #557