When performing large scale seqeuncing the input for certain samples and particular MIPs within can be extremely deep (may reads for a given MIP in a given sample). This occurs when controls are repeated sequenced and merged together. The best place to subsample to reduce depth is after UMI determination and correction. The follwoing script does the subsampling
However the subsampling is random which is not optimal as it would be preferable to have this deterministic. Also, UMIs with the most read support make the most optimal sequences to subsample.
Solution Requested
Modify algorithm downsampling script to sort UMI sequences deterministiically based on # of supporting reads and then trim off those with lower read support if the number of UMI sequences exceeds the input threshold.
Describe alternatives you've considered
I am not sure there is really justification for alterantives unless one can argue that one wants to explore the effect of suboptimally selecting UMI sequences
Related Problem
When performing large scale seqeuncing the input for certain samples and particular MIPs within can be extremely deep (may reads for a given MIP in a given sample). This occurs when controls are repeated sequenced and merged together. The best place to subsample to reduce depth is after UMI determination and correction. The follwoing script does the subsampling
https://github.com/bailey-lab/MIPTools/blob/master/src/wrangler_downsample_umi.py
However the subsampling is random which is not optimal as it would be preferable to have this deterministic. Also, UMIs with the most read support make the most optimal sequences to subsample.
Solution Requested
Modify algorithm downsampling script to sort UMI sequences deterministiically based on # of supporting reads and then trim off those with lower read support if the number of UMI sequences exceeds the input threshold.
Describe alternatives you've considered I am not sure there is really justification for alterantives unless one can argue that one wants to explore the effect of suboptimally selecting UMI sequences