group --per-gene for 1.6 million nanopore reads of a PCR amplicon consumes too much memory

gordonrix commented 3 years ago

Hello! I've been developing a pipeline that involves consensus sequence generation of PCR amplicons using a custom script to extract barcodes and appending them to the read name, then umi-tools group to group the UMIs appropriately, then using another custom script to generate consensus sequences from the group output.

Because the nanopore platform produces lots of errors, reads map to slightly different positions despite being reads of the same amplicon, and I need to use a networking method from umi-tools to account for the inevitable many errors. I am using the --per-gene option so that all reads are considered to be the same position and only UMIs are used for grouping, but this should be fine because my UMI extraction step ensures that only true reads of the target amplicon make it into the umi-tools group step.

This has worked well for sequencing runs with on the order of 100,000 reads as input, but when I tried running it with a sequencing dataset that had 1.6 million reads of the 1.6 kb target gene as input (2.5 gb BAM file), it consumed all memory available, ~60 GB, only generating the following output before triggering the job to be killed.

# UMI-tools version: 1.1.1
# output generated by group -I sequences/UMI/ie4-P12_UMIextract.bam --group-out=sequences/UMI/ie4-P12_UMIgroup-log.tsv --output-bam --per-gene --gene-tag=GN --edit-distance-threshold 2 -S sequences/UMI/ie4-P12_UMIgroup.bam
# job started at Sun Mar 28 22:50:44 2021 on hpc3-14-14 -- 966bbcd2-e9ca-4442-bd13-c98837cd6ebe
# pid: 31255, system: Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64
# assigned_tag                            : None
# cell_tag                                : None
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chimeric_pairs                          : use
# chrom                                   : None
# compresslevel                           : 6
# detection_method                        : None
# filter_umi                              : None
# gene_tag                                : GN
# gene_transcript_map                     : None
# get_umi_method                          : read_id
# ignore_tlen                             : False
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 0
# method                                  : directional
# no_sort_output                          : False
# out_sam                                 : False
# output_bam                              : True
# output_unmapped                         : False
# paired                                  : False
# per_cell                                : False
# per_contig                              : False
# per_gene                                : True
# random_seed                             : None
# read_length                             : False
# short_help                              : None
# skip_regex                              : ^(__|Unassigned)
# soft_clip_threshold                     : 4
# spliced                                 : False
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='sequences/UMI/ie4-P12_UMIextract.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='sequences/UMI/ie4-P12_UMIgroup.bam' mode='w' encoding='UTF-8'>
# subset                                  : None
# threshold                               : 2
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# tsv                                     : sequences/UMI/ie4-P12_UMIgroup-log.tsv
# umi_group_tag                           : BX
# umi_sep                                 : _
# umi_tag                                 : RX
# umi_tag_delim                           : None
# umi_tag_split                           : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# unmapped_reads                          : discard
# unpaired_reads                          : use
# whole_contig                            : False
2021-03-28 22:50:44,926 INFO command: group -I sequences/UMI/ie4-P12_UMIextract.bam --group-out=sequences/UMI/ie4-P12_UMIgroup-log.tsv --output-bam --per-gene --gene-tag=GN --edit-distance-threshold 2 -S sequences/UMI/ie4-P12_UMIgroup.bam
2021-03-28 22:51:20,495 INFO Parsed 1000000 input reads

Based on some previous issues threads, I tried running it again with --method=unique, and that was successful despite being run on my local machine with only 10 gb of RAM available:

# UMI-tools version: 1.1.1
# output generated by group -I sequences/UMI/ie4-P12_UMIextract.bam --group-out=sequences/UMI/ie4-P12_UMIgroup-log.tsv --output-bam --per-gene --gene-tag=GN --method=unique -S sequences/UMI/ie4-P12_UMIgroup.bam
# job started at Mon Mar 29 10:19:22 2021 on popos2004.linuxvmimages.local -- 8b634c05-4e14-4e97-90c5-7dec6452db01
# pid: 2457, system: Linux 5.4.0-7629-generic #33~1589834512~20.04~ff6e79e-Ubuntu SMP Mon May 18 23:29:32 UTC  x86_64
# assigned_tag                            : None
# cell_tag                                : None
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chimeric_pairs                          : use
# chrom                                   : None
# compresslevel                           : 6
# detection_method                        : None
# filter_umi                              : None
# gene_tag                                : GN
# gene_transcript_map                     : None
# get_umi_method                          : read_id
# ignore_tlen                             : False
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 0
# method                                  : unique
# no_sort_output                          : False
# out_sam                                 : False
# output_bam                              : True
# output_unmapped                         : False
# paired                                  : False
# per_cell                                : False
# per_contig                              : False
# per_gene                                : True
# random_seed                             : None
# read_length                             : False
# short_help                              : None
# skip_regex                              : ^(__|Unassigned)
# soft_clip_threshold                     : 4
# spliced                                 : False
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='sequences/UMI/ie4-P12_UMIextract.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='sequences/UMI/ie4-P12_UMIgroup.bam' mode='w' encoding='UTF-8'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# tsv                                     : sequences/UMI/ie4-P12_UMIgroup-log.tsv
# umi_group_tag                           : BX
# umi_sep                                 : _
# umi_tag                                 : RX
# umi_tag_delim                           : None
# umi_tag_split                           : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# unmapped_reads                          : discard
# unpaired_reads                          : use
# whole_contig                            : False
2021-03-29 10:19:22,618 INFO command: group -I sequences/UMI/ie4-P12_UMIextract.bam --group-out=sequences/UMI/ie4-P12_UMIgroup-log.tsv --output-bam --per-gene --gene-tag=GN --method=unique -S sequences/UMI/ie4-P12_UMIgroup.bam
2021-03-29 10:20:03,452 INFO Parsed 1000000 input reads
2021-03-29 10:24:12,223 INFO Reads: Input Reads: 1606703
2021-03-29 10:24:12,224 INFO Number of reads out: 1606703, Number of groups: 904608
2021-03-29 10:24:12,224 INFO Total number of positions deduplicated: 1
2021-03-29 10:24:12,224 INFO Mean number of unique UMIs per position: 904608.00
2021-03-29 10:24:12,224 INFO Max. number of unique UMIs per position: 904608
# job finished in 289 seconds at Mon Mar 29 10:24:12 2021 -- 260.76 26.71  0.00  0.00 -- 8b634c05-4e14-4e97-90c5-7dec6452db01

My UMI is 20 nucleotides long, so even if all the UMIs were unique, this is < 0.0002% of the possible UMIs so I should be in good shape there. Seems like this is a fairly unique use case though so perhaps there is some option you know of that might be able to help reduce memory usage? I'm currently trying to run it again using a higher memory node, but long term I'd like to be able to run this step locally, so being able to keep it under ~60 gb of memory usage would be very helpful.

IanSudbery commented 3 years ago

The memory usage depends not just on the total % of UMI space used, but how densely the subpart of the space that is used is.

You currently have 904608 UMIs.

With a 20nt UMI and an edit threshold of 2, each UMI has a 6060=3600 UMI edit ball around it. If each of your 904608 UMIs were full connected to their edit ball that would mean the network had 9046083600 connections in it. Thats around 3.3 billion connections. (of course that can't actaully be true, because only when every UMI in the space is used can every UMI be fully connected, but lets assume the majority, in the centre of the network, are fully connected). If each connection takes ~40 bytyes, that gives total memory of around 121GB.

The easiest way to reduce this would be to set te edit distance threshold to 1, rather than 2. One mght argue that if nanopore is sufficiently error prone, then you will get reads with 2 errors in them. This is undoubtly true. But my guess is that you would also get the 1 edit intermediates in higher number, and so the networks would still build.

The edit ball around an UMI with an edit distance of 1 is 60 times smaller than the edit distance around an UMI with an edit of 2.

gordonrix commented 3 years ago

Thanks for the fast and thorough explanation!

It's currently running with an edit distance of 2 and available memory of ~2x your estimated usage. Hopefully that will complete and I will then be able to compare to an edit distance of 1 and decide if it's worth doing 2 in the future.

One thing I'm a bit unclear on though is the assumption that the majority of UMIs in the center of the network are connected. With how sparsely I should be sampling the UMI space (there should only be ~100,000 true UMIs, but we'll see how that estimation holds up), I would expect that only about 1/10 of the UMIs are fully connected. I suppose with some error in my library prep or assumptions of memory usage, that could still reasonably use up 60GB?

Haven't looked into this much, but do you know of an easy way to track memory usage on a cluster?

gordonrix commented 3 years ago

Once my job(s) finish, I'll post results from running with an edit distance of 1, as well as 2 if that works, and close the issue.

IanSudbery commented 3 years ago

Sorry, that was a worst case calculation.

TomSmithCGAT commented 8 months ago

Closing due to inactivity.

CGATOxford / UMI-tools

group --per-gene for 1.6 million nanopore reads of a PCR amplicon consumes too much memory #466