Improve typing of functions in 'crispr' module

dgruano commented 8 months ago

I was playing around with the crispr module and came across a weird error where the cut coordinates of a cas9 object were way larger than the target sequence.

from pydna.dseqrecord import Dseqrecord
from pydna.crispr import cas9

guide = Dseqrecord("GTTACTTTACCCGACGTCCC")
target = Dseqrecord("GTTACTTTACCCGACGTCCCaGG")

# Create an enzyme object with the guide RNA
enzyme = cas9(str(guide.seq))

# Search for a cutsite in the target sequence
print(enzyme.search(target))  # prints [148] (should be 18)
print(len(target))  # prints 23

The problem was that I was passing a Dseqrecord object and not a string. I am not very familiar yet with the rest of pydna so do most functions require a string or a Dseq / Dseqrecord object? Should we check the input type within the functions or add type hinting?

Let me know if I can help.

BjornFJohansson commented 8 months ago

Hi and thanks for your interest in pydna. I have been busy with this years round of grant proposals, nomrally I try to respond quicker.

The crispr module right now is a minimally working example. I think the way to go here is to specify something that intuitively describes a linear ssDNA molecule. In pydna, Dseq and Dseqrecords are used for dsDNA. I think better type hinting at the least and perhaps accepting pydna.seqrecord.SeqRecord would make sense?

manulera commented 2 months ago

Hi @dgruano maybe you want to give a go at this one in the Hackathon?

manulera commented 2 months ago

Related to #257

dgruano commented 2 months ago

Yes, I was counting on doing that!

(actually I would swear I had tagged this issue on #257 yesterday...)

manulera commented 2 months ago

A nice followup to this is the documentation: https://github.com/BjornFJohansson/pydna/issues/259

hiyama341 commented 2 months ago

I also have some ideas that would be cool to implement if you wanna team up for the hackathon @dgruano :)

dgruano commented 2 months ago

I'm all ears!

hiyama341 commented 1 month ago

Hi @dgruano, so some of the things I was thinking of incorporating are:

Off-target counter as a method. I have a script that does this, which people usually ask for first thing if they do CRISPR experiments. Here we could add seed length as an argument. Also incorporating something like this: https://github.com/secondarymetabolites/nearmiss would be nice in terms of finding substitutions to have even fewer off-target effects.
Other Cas-systems would be nice to have i.e. Cas12a, Cas3, Cas13. There are common themes in how they work but are still different in regards to where the pam is etc. (Also have some scripts for this)
CRISPR-BEST integration (I have some scripts for this, but check out this cool method here https://pubs.acs.org/doi/full/10.1021/acssynbio.3c00188 ). There is something with sequence context that is quite important i.e what comes before a cytosine etc. if you want to have successful experiments every time and hardcoding this into pydna would be amazing (Check it out here: https://www.nature.com/articles/nbt.4199)

These were just some preliminary thoughts. Looking forward to hearing what you think. :)

dgruano commented 1 month ago

Those are really good suggestions! Maybe we could compile a list of enzymes and methods with appropriate references and then detail the needed steps (e.g. Cas12 is just creating a new enzyme class, but CRISPR-BEST may need new functions). Something like:	Feature	Type
Cas12 / Cpf1	New enzyme	https://www.cell.com/cell/fulltext/S0092-8674(15)01200-3
Alternative Cas9	New enzyme	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4393360
Analyze sequence context	New feature	https://www.nature.com/articles/nbt.4199
Genome editing	New feature	https://pubs.acs.org/doi/full/10.1021/acssynbio.3c00188 and here

I am unsure how you would use nearmiss to limit off-targets, can you develop what were you thinking? I will certainly give it a look for my other suggestion in #267 !

dgruano commented 1 month ago

Other possible features:

Near PAM-less / PAM-flexible enzymes

The CRISPR module should also support those Cas enzymes that have more than one PAM. Forr this, we have to:

Support for ambiguous nucleotide notation (IUPAC notation in the PAM sequence.
Convert this into all the compatible PAMs
Change the way the search regexp is compiled to support multiple PAMs or allow the cas object to return several objects

PAM site search

Taking advantage of Dseq.get_cutsites() we could check all posible PAMs with the currently implemented Cas enzymes (or those enzymes in the collection of the user). We could add a constant crispr.CAS_ENZYMES in the module.

On-target and off-target scores

I'm not very knowledgeable on this respect, but could be a nice addition for the designed guides. Some references are: On-Target

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4744125/

Off-Target

dgruano commented 1 month ago

I totally missed this one:

Support for base editors

This is related to something we want to do in ShareYourCloning. We could achieve this like:

Create a subclass of the cas enzyme that cannot cut (only target). We could add the base editing functionality inside it or attach a BaseEditor object. I don't know how modular the base editors are (i.e. if we can combine different cas enzymes with disticnt PAMs and scaffolds together with different editing enzymes).

hiyama341 commented 1 month ago

Cool suggestions @dgruano!

Regarding base editing, this is something I worked with in StreptoCAD and that CRISPYweb also does. We could make a subclass like you suggest since they work almost exactly like Cas9 just with an editing window.
For the On-target, I think it is something we can add. I found this tool that could be used for inspiration: https://academic.oup.com/bioinformatics/article/38/24/5437/6769890?login=false . In terms of CRISPR efficacy I think it is not needed - most tools are simply not accurate enough - and all the wet lab scientists I know don't really believe in them and follow the approach of trying a few guides instead, which works super well.

For the nearmiss, I think it is a bit of an overkill since the computational load is pretty heavy.

BjornFJohansson / pydna