bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
671 stars 71 forks source link

Poly n Plannotate #396

Closed Koeng101 closed 10 months ago

Koeng101 commented 11 months ago

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8262757/

I'd like to get the plannotate auto annotation suite working with poly.

Here's a link to the code: https://github.com/mmcguffi/pLannotate/tree/master

Basically, this would let us auto-annotate plasmids. A very useful task!

abondrn commented 11 months ago

Would be happy to take this on! Here's a 2 ways we could approach it:

  1. Closely integrate with the plannotate batch CLI, which takes fasta files and produces output files (genbank or csv). This has the obvious benefit of being quicker to implement, test, and review; and because this calls out to Python via the CLI, future updates from plannotate would not have to be ported to go in order to be utilized. This is what I am leaning towards.
  2. Faithfully port the core logic, which uses several CLI tools (blastn, diamond, infernal) to query several databases that are distributed with plannotate found here and are then aggregated with pandas. This option would result in fast annotation, and by adding local alignment search natively to poly it unlocks future functionality such as CRISPR gRNA design.

In short, both require calling out to external CLI tooling, but option 2 does not have a Python dependency but requires additional work as a result.

abondrn commented 11 months ago

One callout: plannotate is distributed under the GNU GPL v3 license. This shouldn't impact option 1, as we do not plan to distribute poly with pLannotate, but it will impact users that may want to use poly + plannotate. Option 2 may be impacted, as poly may become a derivative work, which we don't want if we want to keep using the MIT license.

Koeng101 commented 11 months ago

One callout: plannotate is distributed under the GNU GPL v3 license. This shouldn't impact option 1, as we do not plan to distribute poly with pLannotate, but it will impact users that may want to use poly + plannotate. Option 2 may be impacted, as poly may become a derivative work, which we don't want if we want to keep using the MIT license.

One bit here: DNA cannot be copyrighted. The most important thing that they've made, in my opinion, is the sweet,sweet database of part features. The raw sequences we should be able to use without infringing on any copyright. Translation to a whole new language means it probably isn't derivative work on the code-level.

I suppose I should be more specific with the desire here: I would like the abilities of plannotate, regardless of implementation. So option 2, though I don't think we have to care much about faithfully reproducing the core logic! We just need 98% matching to the full sequence - ie, table 1.

If the goal is to write no new code, you can probably just select down the possible matches using mash, then do a Needleman-Wunsch alignment using align. I've found it's really really really slow, though, but could work. There is a reason blast is a thing

The other option would be getting blast or minimap2 or the like integrated into Poly. I've been looking at doing this with biowasm, but there are some annoying points around getting that to work (been posting my work on discord). It could also be done with cgo, but again, that is also annoying.

I would love this to be done and am very willing to help!

Koeng101 commented 11 months ago

I've never really used DIAMOND, but I'd also be fine with just looking for perfect amino acid matches right now. The nucleotide matching is the important part for a version 1 IMO

TimothyStiles commented 10 months ago

I'm not sure the approach but this may be a good external thing to start?