CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Get number of reads per UMI, irrespective of mapping position #487

Closed nickforino closed 2 years ago

nickforino commented 3 years ago

Hi there,

I've performed an amplicon sequencing experiment and simply want a diagnostic look into PCR duplication events in the library. Looking through the flatfile output of the group tool, I've noticed that some UMIs that are identical in sequence are assigned non-identical unique_ids (presumably because of differences in mapping?) which makes it difficult for me to get the true number of how many reads carry the same UMI. This is the important question I'm after; how many reads have the same UMI?

For example, here are some relevant lines from the flatfile:

read_id contig position gene umi umi_count final_umi final_umi_count unique_id NB551675:127:H5CLFAFX3:4:21612:24023:15475_GGGA XXX 0 NA GGGA 2935 GGGA 2935 1126 NB551675:127:H5CLFAFX3:2:11207:16465:18031_GGGA XXX 210 NA GGGA 1 GGGA 1 1737

The 'GGGA' UMI with ID 1126 was seen 2935 times, while the 'GGGA' with ID 1737 was seen once.

Should I be using a different tool to see the how many reads have identical reads? Count perhaps?

Thanks in advance for your help!

-Nick

IanSudbery commented 3 years ago

This info might be one of the things output if you ask umi_tools to --output-stats=FILE_NAME_PREFIX. The information you are after will be in the file names FILE_NAME_PREFIX_per_umi.tsv.

Couple of things to note, however -

  1. Running with stats takes significantly longer, and uses significantly more memory. We should probably separate out the different stats output, but this particular table is not the one that is demanding to produce.
  2. It is unlikely that your UMI with ID 1126 and the one with 1737 are PCR duplicates of each other, particularly with such a short UMI. With a 4 base UMI there are only 256 possible different sequences, so once your sample has more than 256 different molecules in it, molecules with necessarily receive the same UMI despite being biologically independent. So if you want a diagnostic on the total amount of PCR dupclication in a sample, just counting how many times each UMI was used, ignoring position, is not going to give the the correct answer. Excuse me if you are already aware of this, but I just thought I'd make sure.

On Thu, 9 Sept 2021 at 22:53, nickforino @.***> wrote:

Hi there,

I've performed an amplicon sequencing experiment and simply want a diagnostic look into PCR duplication events in the library. Looking through the flatfile output of the group tool, I've noticed that some UMIs that are identical in sequence are assigned non-identical unique_ids (presumably because of differences in mapping?) which makes it difficult for me to get the true number of how many reads carry the same UMI. This is the important question I'm after; how many reads have the same UMI?

For example, here are some relevant lines from the flatfile:

read_id contig position gene umi umi_count final_umi final_umi_count unique_id NB551675:127:H5CLFAFX3:4:21612:24023:15475_GGGA XXX 0 NA GGGA 2935 GGGA 2935 1126 NB551675:127:H5CLFAFX3:2:11207:16465:18031_GGGA XXX 210 NA GGGA 1 GGGA 1 1737

The 'GGGA' UMI with ID 1126 was seen 2935 times, while the 'GGGA' with ID 1737 was seen once.

Should I be using a different tool to see the how many reads have identical reads? Count perhaps?

Thanks in advance for your help!

-Nick

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/487, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJELDXYBB2ADBFGHNESBTLUBEUD7ANCNFSM5DYBXKKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

IanSudbery commented 3 years ago

Did this work for you?

nickforino commented 3 years ago

Hi Ian, apologies for my late reply,

Yes, the solution you outlined worked perfectly, thank you!

Your warning is well received, I'll outline my thinking and hopefully it is sound: We are performing amplicon sequencing of just one target (hence why we aren't interested in position), and thus we expect that PCR bias will manifest as one (or more) of the 256 UMIs taking the lion's share of the reads. Once I looked at the UMI counts, I did not see any UMI represented more than one standard deviation from the mean, so my takeaway is that there does not exist any PCR product that dominates the reaction when generating the library.

Again, thank you for your help, Ian! Best, Nick