Closed Koeng101 closed 10 months ago
Would be happy to take this on! Here's a 2 ways we could approach it:
blastn
, diamond
, infernal
) to query several databases that are distributed with plannotate found here and are then aggregated with pandas. This option would result in fast annotation, and by adding local alignment search natively to poly it unlocks future functionality such as CRISPR gRNA design.In short, both require calling out to external CLI tooling, but option 2 does not have a Python dependency but requires additional work as a result.
One callout: plannotate is distributed under the GNU GPL v3 license. This shouldn't impact option 1, as we do not plan to distribute poly with pLannotate, but it will impact users that may want to use poly + plannotate. Option 2 may be impacted, as poly may become a derivative work, which we don't want if we want to keep using the MIT license.
One callout: plannotate is distributed under the GNU GPL v3 license. This shouldn't impact option 1, as we do not plan to distribute poly with pLannotate, but it will impact users that may want to use poly + plannotate. Option 2 may be impacted, as poly may become a derivative work, which we don't want if we want to keep using the MIT license.
One bit here: DNA cannot be copyrighted. The most important thing that they've made, in my opinion, is the sweet,sweet database of part features. The raw sequences we should be able to use without infringing on any copyright. Translation to a whole new language means it probably isn't derivative work on the code-level.
I suppose I should be more specific with the desire here: I would like the abilities of plannotate, regardless of implementation. So option 2, though I don't think we have to care much about faithfully reproducing the core logic! We just need 98% matching to the full sequence - ie, table 1.
If the goal is to write no new code, you can probably just select down the possible matches using mash
, then do a Needleman-Wunsch alignment using align
. I've found it's really really really slow, though, but could work. There is a reason blast is a thing
The other option would be getting blast or minimap2 or the like integrated into Poly. I've been looking at doing this with biowasm, but there are some annoying points around getting that to work (been posting my work on discord). It could also be done with cgo, but again, that is also annoying.
I would love this to be done and am very willing to help!
I've never really used DIAMOND, but I'd also be fine with just looking for perfect amino acid matches right now. The nucleotide matching is the important part for a version 1 IMO
I'm not sure the approach but this may be a good external thing to start?
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8262757/
I'd like to get the plannotate auto annotation suite working with poly.
Here's a link to the code: https://github.com/mmcguffi/pLannotate/tree/master
Basically, this would let us auto-annotate plasmids. A very useful task!