[CRISPR module] Support IUPAC notation in PAMs

BjornFJohansson / pydna

Clone with Python! Data structures for double stranded DNA & simulation of homologous recombination, Gibson assembly, cut & paste cloning.

Other

166 stars 45 forks source link

[CRISPR module] Support IUPAC notation in PAMs #280

Open dgruano opened 1 month ago

dgruano commented 1 month ago

Staphilococcus pyogenes Cas9 has an NGG PAM, but Staphilococcus aureus Cas9 has an NNGRRT PAM. While N means any nucleotide, R means A or G. We should support this notation so users can input a human-readable PAM sequence using IUPAC notation without the need to coding the associated regex.

Additional note: I would also change the current notation .GG to NGG for the same purpose.

BjornFJohansson commented 1 month ago

Good idea. Is there a solution out there that we can use?

I found this online: https://www.biostars.org/p/298791/

and this: https://benchling.engineering/building-a-regex-search-engine-for-dna-e81f967883d3

dgruano commented 1 month ago

Given that right now we use regex to find possible enzyme cuts (and that sequences are relatively short compared to a large database), I created a small function to parse a PAM with ambiguous nucleotides and build a regex. This regex is then fed to the one used for search.

And Biopython has the dictionaries with the ambiguous nucleotides, so I'm taking advantage of that!

BjornFJohansson commented 1 month ago

ok, seems a good solution.