Open dgruano opened 1 month ago
Good idea. Is there a solution out there that we can use?
I found this online: https://www.biostars.org/p/298791/
and this: https://benchling.engineering/building-a-regex-search-engine-for-dna-e81f967883d3
Given that right now we use regex to find possible enzyme cuts (and that sequences are relatively short compared to a large database), I created a small function to parse a PAM with ambiguous nucleotides and build a regex. This regex is then fed to the one used for search.
And Biopython has the dictionaries with the ambiguous nucleotides, so I'm taking advantage of that!
ok, seems a good solution.
Staphilococcus pyogenes Cas9 has an NGG PAM, but Staphilococcus aureus Cas9 has an NNGRRT PAM. While N means any nucleotide, R means A or G. We should support this notation so users can input a human-readable PAM sequence using IUPAC notation without the need to coding the associated regex.
Additional note: I would also change the current notation
.GG
toNGG
for the same purpose.