WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
914 stars 122 forks source link

[Not issue] Thank you - fantastic for counting CRISPR guides #145

Closed mjafin closed 2 years ago

mjafin commented 3 years ago

Hi there, Many thanks for this package. Wasn't sure where to leave a thank you note but this package is absolutely fantastic in our application where we have a library of 100k+ CRISPR guides that we have to count in a stream of millions of DNA sequencing reads. This package does it faster than the previous C program we used for the purpose and helps us stick to just Python code in our pipeline.

Best wishes, Miika (AstraZeneca Functional Genomics Centre)

WojciechMula commented 3 years ago

Thank you, what a great testimonial! I'm happy that our work appeared useful for your job. I'm happy Python survived, working with this language is pure joy.

If you don't mind, I'll quote your post in our documentation, it's so rewarding. :)

BTW I'm not familiar with genomics, could you please briefly describe the problem you solved? Or provide some pointer to the algorithm.

WojciechMula commented 3 years ago

@mjafin I have a completely separate question: do you use https://github.com/samtools/samtools?

mjafin commented 3 years ago

Thank you, what a great testimonial! I'm happy that our work appeared useful for your job. I'm happy Python survived, working with this language is pure joy.

If you don't mind, I'll quote your post in our documentation, it's so rewarding. :)

BTW I'm not familiar with genomics, could you please briefly describe the problem you solved? Or provide some pointer to the algorithm.

The problem we have is perhaps best illustrated in this figure https://www.sciencedirect.com/science/article/pii/S1044579X17302742#fig0015. In Fig 3A you see 3 "guides" as the purple, green and yellow wiggles. Each of these guides in a CRISPR experiment cuts a gene causing that gene to be knocked out (or knocked in).

Functional Genomics is about knocking out "all" (20k) genes, not just three, in one experiment and seeing how the knockout either kills cells, does nothing or causes growth. This helps us e.g. identify new gene targets that could preferentially kill cancer cells and leave normal cells alone. After the experiment is finished some of the guides are no longer in the population (green in Fig3A) because the cells died, while others (purple, yellow) can still be observed where there was no effect to the gene cutting. The guides can be sequenced using a DNA sequencer and then the Aho-Corasick algorithm applied to count the presence or absence of the 20k+ different guides in the sequencing data. Does this help?

mjafin commented 3 years ago

@mjafin I have a completely separate question: do you use https://github.com/samtools/samtools?

Definitely, it's one of the standard tools of sequence analysis in a slightly different context. In our specific CRISPR application we're not terribly interested in the sequencing data itself as we're merely counting tags. However, in most sequencing applications where the sequence itself needs to be analysed for modifications and such samtools is part of the standard toolbox.

WojciechMula commented 3 years ago

@mjafin Thanks a lot for the explanation, that's a really cool application of Aho-Corasick.

I have a little research project related to speeding up one function of samtools, could you please contact me via email (wojciech_mula@poczta.onet.pl) or linkedin?

mjafin commented 3 years ago

Sounds good, pinged you on LinkedIn

pombredanne commented 2 years ago

@mjafin I added your quote to the readme ... See https://github.com/WojciechMula/pyahocorasick/pull/160/commits/faf82874192c519f237dd2e4b89049610a4c67eb I hope this is OK, otherwise please advise what would work best! Thank you again.