Closed erikyao closed 2 years ago
Related Issue: https://github.com/biothings/myvariant.info/issues/116
I wonder if it would be of significant help. Python's re
will compile and cache things automatically.
https://github.com/python/cpython/blob/2cd268a3a9340346dd86b66db2e9b428b3f878fc/Lib/re.py#L187-L191
https://github.com/python/cpython/blob/2cd268a3a9340346dd86b66db2e9b428b3f878fc/Lib/re.py#L288-L295
@zcqian thanks for the information! I found I made a mistake building up the tests. Additionally, I found that re.match(compiled_pattern, string)
is actually slower. We should switch to compiled_pattern.match(string)
which reduces the cache lookup overhead. Please find the updated results above.
still a pretty significant speedup, nice.
src/utils/hgvs.py uses
re
module a lot. If we compile the regex patterns in advance, it will save a lot of time when parsing the documents where thishgvs.py
is used.I have made a minimal working example (in jupyter notebook) here:
The compiled version is 60% faster than the un-compiled one.
This
^[ACGTN]+$
pattern is matched in theget_hgvs_from_vcf()
function, which is further used by a few parsers. E.g. consider the gnomad parser. We have 1,055,643,939 gnomad documents. If each document matches the above pattern twice, we can reduce the matching time from 30 minutes to 12.