bailey-lab / MIPTools

A suite of computational tools used for molecular inversion probe design, data processing, and analysis.
https://miptools.readthedocs.io
MIT License
6 stars 9 forks source link

Downsample UMI counts #40

Closed arisp99 closed 2 years ago

arisp99 commented 2 years ago

To be merged following #39.

During the wrangler app, run a series of four MIPWrangler commands. These commands first correct barcode counts (UMIs) and then cluster MIPs together. When we cluster the MIPs together the UMI count for a specific cluster is equal to the sum of the UMI count for each MIP within that cluster. As a result, in some cases, the number of UMI counts can reach very large numbers and slow down our downstream analysis variant calling steps. In order to increase the speed of our pipeline, w can downsample the UMI counts before merging MIPs together.

This PR serves to reduce the number of UMI counts by downsampling counts above a user-defined threshold (default value: 1000). To do so, we manipulate the underlying FASTQ files that MIPWrangler relies on.

JeffAndBailey commented 2 years ago

I would default to 2k

On Fri, May 6, 2022, 8:21 AM Aris Paschalidis @.***> wrote:

To be merged following #39 https://github.com/bailey-lab/MIPTools/pull/39.

During the wrangler app, run a series of four MIPWrangler https://github.com/bailey-lab/MIPWrangler commands. These commands first correct barcode counts (UMIs) and then cluster MIPs together. When we cluster the MIPs together the UMI count for a specific cluster is equal to the sum of the UMI count for each MIP within that cluster. As a result, in some cases, the number of UMI counts can reach very large numbers and slow down our downstream analysis variant calling steps. In order to increase the speed of our pipeline, w can downsample the UMI counts before merging MIPs together.

This PR serves to reduce the number of UMI counts by downsampling counts above a user-defined threshold (default value: 1000). To do so, we manipulate the underlying FASTQ files that MIPWrangler https://github.com/bailey-lab/MIPWrangler relies on.

You can view, comment on, or merge this pull request online at:

https://github.com/bailey-lab/MIPTools/pull/40 Commit Summary

File Changes

(8 files https://github.com/bailey-lab/MIPTools/pull/40/files)

Patch Links:

— Reply to this email directly, view it on GitHub https://github.com/bailey-lab/MIPTools/pull/40, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXH6YMQXVAZI44XNH2ELV3VIST5HANCNFSM5VHBV6YA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

arisp99 commented 2 years ago

Following a conversation with @JeffAndBailey and @iek, this PR has been significantly updated.

Previously, our strategy for downsampling included reading each FASTQ and changing the read count for each UMI. However, this approach did not solve the original problem as, in some cases, there were still MIPs with many thousands of UMIs. We now adjust the number of UMIs found for each MIP. To do so, we remove lines from the FASTQ files that MIPWrangler relies on. MIPWrangler combines the FASTQs for each UMI into one master file. We iterate over this file, and if the number of UMIs (i.e., the number of sets of four lines) is greater than our threshold, we reduce the number UMIs (by removing sets of four lines).

The other main change now implemented is that when downsampling, the user may choose to downsample by randomly selecting UMIs weighed by their read counts. In other words, UMIs with a higher read count will have a lower probability of being removed when downsampling. This option is controlled by the -w flag within the wrangler app. By default, we randomly subset the UMIs and do not consider the read count of each UMI.

arisp99 commented 2 years ago

Below find a table with some benchmarking statistics on the run time and the memory utilization for a single iteration of the wrangler app. We compare the old version of MIPTools with this branch, which includes the downsampling of UMIs.

Description Wall Clock Time Memory Utilized
MIPTools v0.4.0 1:27:10 3.72 GB
Default Downsampling 1:41:56 3.72 GB
Weighted Downsampling 1:41:10 3.63 GB
Downsampling, Threshold 500 UMIs 1:39:36 3.64 GB
Weighted Downsampling, Threshold 500 UMIs 1:42:13 3.62 GB