Use pre-compiled regex to accelerate matching

erikyao commented 2 years ago

src/utils/hgvs.py uses re module a lot. If we compile the regex patterns in advance, it will save a lot of time when parsing the documents where this hgvs.py is used.

I have made a minimal working example (in jupyter notebook) here:

# Generate a sample of random sequences

import random
random.seed(1002)

def random_seq(length):
    return ''.join(random.choice('CGTA') for _ in range(length))

seq_list = []  # 100 sequences in total 
for length in range(1, 11):
    seq_sublist = [random_seq(length) for _ in range(10)]
    seq_list += seq_sublist

# Target functions: un-compiled vs compiled

def match_1(seq_list):
    for seq in seq_list:
        re.match('^[ACGTN]+$', seq)

SEQ_PATTERN = re.compile('^[ACGTN]+$')
def match_2(seq_list):
    for seq in seq_list:
        # re.match(SEQ_PATTERN, seq)  # actually slower than re.match('^[ACGTN]+$', seq)
        SEQ_PATTERN.match(seq)

# Performance Comparison

%timeit -n 100 -r 100 match_1(seq_list)
86.9 µs ± 2.63 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)

%timeit -n 100 -r 100 match_2(seq_list)
35.4 µs ± 1.4 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)

The compiled version is 60% faster than the un-compiled one.

This ^[ACGTN]+$ pattern is matched in the get_hgvs_from_vcf() function, which is further used by a few parsers. E.g. consider the gnomad parser. We have 1,055,643,939 gnomad documents. If each document matches the above pattern twice, we can reduce the matching time from 30 minutes to 12.

# Time Elapsed

1055643939 * 2 / 100 * 86.9 / (10**6) / 60  # us to minute
30.5784860997

1055643939 * 2 / 100 * 35.4 / (10**6) / 60  # us to minute
12.4565984802

erikyao commented 2 years ago

zcqian commented 2 years ago

I wonder if it would be of significant help. Python's re will compile and cache things automatically.

https://github.com/python/cpython/blob/2cd268a3a9340346dd86b66db2e9b428b3f878fc/Lib/re.py#L187-L191

https://github.com/python/cpython/blob/2cd268a3a9340346dd86b66db2e9b428b3f878fc/Lib/re.py#L288-L295

erikyao commented 2 years ago

@zcqian thanks for the information! I found I made a mistake building up the tests. Additionally, I found that re.match(compiled_pattern, string) is actually slower. We should switch to compiled_pattern.match(string) which reduces the cache lookup overhead. Please find the updated results above.

zcqian commented 2 years ago

still a pretty significant speedup, nice.

biothings / myvariant.info

Use pre-compiled regex to accelerate matching #139