Alot of what I see online suggests that a hard coded (no INDEL) regex is pretty common for DEL calling. Maybe we should add this as an option.
All the infrastructure is there with the CallModes enum, we would just need to add the function that basically just assumes a perfect line up.
I'm sure it will be way faster (we can match 10M seqs with no error tolerance in about 6 seconds ripping on 32 cores, so even a 1B read run would only take 10 minutes). Calling would be way faster too, since the expensive part is the seq alignment which we would forgo here.
They allow for some error correction by asking for the min hamming dist between a failed barcode lookup and all in the set. We can do this leveraging the hamming codes directly (if they exist). But I worry this is just wildly expensive when the error rate is so high. Also it is very possible to error your way to the wrong BB if it isn't hamming encoded.
I'd need to run a few tests to see if this is at all faster and how much worse it makes it at getting successful calls. Just turning off the error tolerance in the primer match (which is still more advanced than a raw match since it still aligns) called only 20% of the reads compared to over 65% when we used the error tolerance...
I don't think I will ever add to this, but if there was a student that wanted to pick up some software experience I think it would be a good intro into both DEL and Software devs stuff
Alot of what I see online suggests that a hard coded (no INDEL) regex is pretty common for DEL calling. Maybe we should add this as an option.
All the infrastructure is there with the CallModes enum, we would just need to add the function that basically just assumes a perfect line up.
I'm sure it will be way faster (we can match 10M seqs with no error tolerance in about 6 seconds ripping on 32 cores, so even a 1B read run would only take 10 minutes). Calling would be way faster too, since the expensive part is the seq alignment which we would forgo here.
This is drawn from these packages: https://github.com/Roco-scientist/NGS-Barcode-Count-C and https://github.com/sunghunbae/decode
They allow for some error correction by asking for the min hamming dist between a failed barcode lookup and all in the set. We can do this leveraging the hamming codes directly (if they exist). But I worry this is just wildly expensive when the error rate is so high. Also it is very possible to error your way to the wrong BB if it isn't hamming encoded.
I'd need to run a few tests to see if this is at all faster and how much worse it makes it at getting successful calls. Just turning off the error tolerance in the primer match (which is still more advanced than a raw match since it still aligns) called only 20% of the reads compared to over 65% when we used the error tolerance...