Downsample UMI counts - Githubissues

arisp99 commented 2 years ago

To be merged following #39.

During the wrangler app, run a series of four MIPWrangler commands. These commands first correct barcode counts (UMIs) and then cluster MIPs together. When we cluster the MIPs together the UMI count for a specific cluster is equal to the sum of the UMI count for each MIP within that cluster. As a result, in some cases, the number of UMI counts can reach very large numbers and slow down our downstream analysis variant calling steps. In order to increase the speed of our pipeline, w can downsample the UMI counts before merging MIPs together.

This PR serves to reduce the number of UMI counts by downsampling counts above a user-defined threshold (default value: 1000). To do so, we manipulate the underlying FASTQ files that MIPWrangler relies on.

JeffAndBailey commented 2 years ago

I would default to 2k

On Fri, May 6, 2022, 8:21 AM Aris Paschalidis @.***> wrote:

To be merged following #39 https://github.com/bailey-lab/MIPTools/pull/39.

During the wrangler app, run a series of four MIPWrangler https://github.com/bailey-lab/MIPWrangler commands. These commands first correct barcode counts (UMIs) and then cluster MIPs together. When we cluster the MIPs together the UMI count for a specific cluster is equal to the sum of the UMI count for each MIP within that cluster. As a result, in some cases, the number of UMI counts can reach very large numbers and slow down our downstream analysis variant calling steps. In order to increase the speed of our pipeline, w can downsample the UMI counts before merging MIPs together.

This PR serves to reduce the number of UMI counts by downsampling counts above a user-defined threshold (default value: 1000). To do so, we manipulate the underlying FASTQ files that MIPWrangler https://github.com/bailey-lab/MIPWrangler relies on.

You can view, comment on, or merge this pull request online at:

https://github.com/bailey-lab/MIPTools/pull/40 Commit Summary

22cc6b1 https://github.com/bailey-lab/MIPTools/pull/40/commits/22cc6b1206b790b0f56f05cb5c79d879e5a2cade Merge branch 'pop-cluster-frac' into downsample-umis

2f9f8e8 https://github.com/bailey-lab/MIPTools/pull/40/commits/2f9f8e86446318b503caefd40b63ce08c28cdc2d Merge branch 'pop-cluster-frac' into downsample-umis

9435c7d https://github.com/bailey-lab/MIPTools/pull/40/commits/9435c7d190dae49aadfdc721509d6dcf92e25b15 Downsample UMIs and set the threshold as an argument

41904b5 https://github.com/bailey-lab/MIPTools/pull/40/commits/41904b5cb824c7cf03f5998d0ad97668db6b42fe Fix the SWGA MIPWrangler scripts so that they work with four inputs

2375e9c https://github.com/bailey-lab/MIPTools/pull/40/commits/2375e9c2eee6db53b6c74dfcd7c6d7c6fe6faed6 Add documentation on downsampling

File Changes

(8 files https://github.com/bailey-lab/MIPTools/pull/40/files)

M MIPTools.def https://github.com/bailey-lab/MIPTools/pull/40/files#diff-277946d780e1e23fc35f59688aed200f924be8222332d40c3c73225a0bf203d6 (9)

M base_resources/MIPWrangler_scripts/runMIPWranglerCurrent.sh https://github.com/bailey-lab/MIPTools/pull/40/files#diff-cd03917e10822c2f50b0377712e4e92c81b7f55b6a760173e876d57a94ec5e52 (26)

M base_resources/MIPWrangler_scripts/runMIPWranglerSwga.sh https://github.com/bailey-lab/MIPTools/pull/40/files#diff-f197aa4b3890e2fbc4897515047c1d5bd158701bb4af3acfc1f7a400138fadb5 (5)

M base_resources/MIPWrangler_scripts/runMIPWranglerSwgaPop.sh https://github.com/bailey-lab/MIPTools/pull/40/files#diff-4b52e4ba76cc943b527a7a6905f4a3a846557da11ceb12fa17473141b408a115 (5)

M bin/runMIPWranglerCurrent.sh https://github.com/bailey-lab/MIPTools/pull/40/files#diff-4bd0476b7fba7d1b0d3f3e6ae36061b1015496b6e3298e8360faca00b6b93c79 (26)

M docs/CHANGELOG.rst https://github.com/bailey-lab/MIPTools/pull/40/files#diff-a3eb23aaf44c5fba1fabdb90f9d57bc4d0ca6ccfecc4374b45835880a3d56934 (5)

M docs/app-reference/wrangler-app.rst https://github.com/bailey-lab/MIPTools/pull/40/files#diff-9a2f615fcdd29a1719006c032669a0d34f14977a2ab631ea993c0d844bcdd931 (4)

M src/generate_wrangler_scripts.py https://github.com/bailey-lab/MIPTools/pull/40/files#diff-d33f11c492d94f50d3b1c7499c8d7c17f80cecaf062f2ce84aa8d3f781e03683 (8)

Patch Links:

https://github.com/bailey-lab/MIPTools/pull/40.patch

https://github.com/bailey-lab/MIPTools/pull/40.diff

— Reply to this email directly, view it on GitHub https://github.com/bailey-lab/MIPTools/pull/40, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXH6YMQXVAZI44XNH2ELV3VIST5HANCNFSM5VHBV6YA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

arisp99 commented 2 years ago

Following a conversation with @JeffAndBailey and @iek, this PR has been significantly updated.

Previously, our strategy for downsampling included reading each FASTQ and changing the read count for each UMI. However, this approach did not solve the original problem as, in some cases, there were still MIPs with many thousands of UMIs. We now adjust the number of UMIs found for each MIP. To do so, we remove lines from the FASTQ files that MIPWrangler relies on. MIPWrangler combines the FASTQs for each UMI into one master file. We iterate over this file, and if the number of UMIs (i.e., the number of sets of four lines) is greater than our threshold, we reduce the number UMIs (by removing sets of four lines).

The other main change now implemented is that when downsampling, the user may choose to downsample by randomly selecting UMIs weighed by their read counts. In other words, UMIs with a higher read count will have a lower probability of being removed when downsampling. This option is controlled by the -w flag within the wrangler app. By default, we randomly subset the UMIs and do not consider the read count of each UMI.

arisp99 commented 2 years ago

Below find a table with some benchmarking statistics on the run time and the memory utilization for a single iteration of the wrangler app. We compare the old version of MIPTools with this branch, which includes the downsampling of UMIs.

Description	Wall Clock Time	Memory Utilized
MIPTools v0.4.0	1:27:10	3.72 GB
Default Downsampling	1:41:56	3.72 GB
Weighted Downsampling	1:41:10	3.63 GB
Downsampling, Threshold 500 UMIs	1:39:36	3.64 GB
Weighted Downsampling, Threshold 500 UMIs	1:42:13	3.62 GB

bailey-lab / MIPTools

Downsample UMI counts #40

This PR serves to reduce the number of UMI counts by downsampling counts above a user-defined threshold (default value: 1000). To do so, we manipulate the underlying FASTQ files that MIPWrangler https://github.com/bailey-lab/MIPWrangler relies on.