bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
663 stars 70 forks source link

Create top-level `types.go` to define types shared across packages #347

Open carreter opened 11 months ago

carreter commented 11 months ago

Is your feature request related to a problem? Please describe. As poly grows, more and more types will be shared between nominally unrelated packages.

For example, many packages will deal with sequences. Currently, we use the string type to represent sequences in most of the packages. However, sequences often also need to be annotated with whether they are circular or linear, as is already done in the clone package and as we are considering doing to resolve #225 :

https://github.com/TimothyStiles/poly/blob/10b8f398c4272700dd27a177af244d6b531d2582/clone/clone.go#L56-L63

Describe the solution you'd like A new file (types.go) in the root of the repository that defines types commonly used across packages. Whenever it feels like you're reusing a similar struct or interface across packages, it should be placed in said file.

Describe alternatives you've considered

Koeng101 commented 11 months ago

This is gonna be entirely unhelpful, but lemme just put some comments below -

  1. When I physically get oligos from IDT, I often add in modifications to the sequence. For example GTAAAACGACGGCCAGTACTTGGTTCAGGTGGAGTAGGGATAA/iMe-dC/AGGGTAATGGTGTCAATCGTCGGAGCATGGTCATAGCTGTTTCCTG, and this is a physically different molecule than GTAAAACGACGGCCAGTACTTGGTTCAGGTGGAGTAGGGATAACAGGGTAATGGTGTCAATCGTCGGAGCATGGTCATAGCTGTTTCCTG, as represented by the lil methylation - although the sequence representation WILL be the same, and any subsequent PCRs or anything are gonna remove that methylation. This actually have a meaningful impact on my cloning reactions that I just have to kinda work around right now.
  2. The representation of overhanged fragments isn't great. For example, lets say I have two oligos AAAAGGGG and CCCCAAAA. Well, these anneal to each other at the GC sites, giving us a new fragment with 2 AAAA overhangs, which then we could use for cloning or the like (this is another real thing I do). However, we have no way of representing these kinds of fragments: are the forward and reverse on 5' or 3' end? Are they phosphorylated (this MATTERS and often it is assumed if they come from fragments, but cannot be assumed if they come from synthesized products).
  3. Seqhash doesn't work with either of the above
  4. AFAIK, there literally aren't any standard open-source formats for things like internal DNA modifications or representation of annealed oligos. So I don't have a way of storing this kind of information. Right now I do some REAL hacky duct-tape shit to make this work with my design software.

Now, opinions on types in general:

  1. Parser types should live with their parsers, and not be shared. Import the package if you needa use something with a parser type.
  2. I cannot think of a type other than "Part" that fits the general category.
  3. IMO part information, when it comes to PCR and cloning, have to be WAY more detailed than anything else. Perhaps this struct should specifically be for synthetic biology-related functions. Maybe we have a synbio package that the pcr, synthesis and clone subpackages get thrown into
cachemoi commented 10 months ago

Just addding my 2 cents on this, I think it'd be nice to have a package with a good name that has a clear scope (so not named types.go but the idea of pooling very common data structures e.g DNA is good)

The reason why is articulated well here

A common cause of poor package names is what call utility packages. These are packages where common helpers and utility code congeals over time. As these packages contain an assortment of unrelated functions, their utility is hard to describe in terms of what the package provides. This often leads to the package’s name being derived from what the package contains--utilities.

Package names like utils or helpers are commonly found in larger projects which have developed deep package hierarchies and want to share helper functions without encountering import loops. By extracting utility functions to new package the import loop is broken, but because the package stems from a design problem in the project, its name doesn’t reflect its purpose, only its function of breaking the import cycle.

It would be good to at least Alias the different sequences to make function signatures clearer and make it impossible to use the wrong sequence type, aka instead of

Translate(string)
Translate(dna.Sequence)

You can put the parsers in those packages to get nice semantics too (e.g dna.FromGenbank() dna.FromFasta() ect...)

A type alias allows you to keep both GTAAAACGACGGCCAGTACTTGGTTCAGGTGGAGTAGGGATAACAGGGTAATGGTGTCAATCGTCGGAGCATGGTCATAGCTGTTTCCTG and GTAAAACGACGGCCAGTACTTGGTTCAGGTGGAGTAGGGATAA/iMe-dC/AGGGTAATGGTGTCAATCGTCGGAGCATGGTCATAGCTGTTTCCTG in your code, just making it clearer that it's a DNA seq and typesafe.

Ultimately though I suspect it would be nicer (but a lot harder) to have a DNA struct that all poly package use, and that way the parsers can all do their best to fill that Poly struct. e.g for the methylation example we'd add a methylationSites map[index]struct{}{} to the DNA struct.

Koeng101 commented 10 months ago

I also think I agree that methylationSites should be separate - perhaps like map[int]MethylationSite where MethylationSite is just a byte that representing that specific internal mutation.

For the sequence itself - I'm not sure if a dna package is really needed. For the majority of stuff, only a sequence is necessary. For a smaller amount, sequence + circular. For an even smaller amount, methylation and overhangs. Right now, we would be putting Part into types, even though the end game the two uses WOULD probably be different - clone requires the complicated methylation data, pcr does not.

TimothyStiles commented 10 months ago

I agree with @cachemoi regarding package names like type or util. The subpackages poly has been split into have been very carefully made to avoid kitchen sink-like packages that make it hard for devs to find what they need.

We do have a sub package called alphabet that @ragnorc made which defines dna and protein sequences but I believe it's being exclusively for the align package as the scoring matrices we borrowed from biogo needed them to function properly.

As far as sharing types between packages are there any pressing corner cases where we need to do this? It's an interesting thought but right now I don't think there's anything that's particularly broken.

carreter commented 10 months ago

Also agreed re: avoiding type/util package names. To be clear, what I was proposing was an additional file types.go in the top-level poly package.

For the sequence itself - I'm not sure if a dna package is really needed. For the majority of stuff, only a sequence is necessary. For a smaller amount, sequence + circular. For an even smaller amount, methylation and overhangs. Right now, we would be putting Part into types, even though the end game the two uses WOULD probably be different - clone requires the complicated methylation data, pcr does not.

This could be solved using Go's interfacing system I think. Something like this would go in the root of the package in types.go:

type DNA struct {} // Carries sequence, circularity, methylation, and overhang info

type Sequence interface {
  Sequence() string
}

type Part interface {
  IsCircular() bool
}

type Methylated interface {
  MethylationSites() map[int]struct{}{}
}

type Overhanged interface {
  LeftOverhang() string
  RightOverhang() string
}

Then, client packages can compose the interfaces to specify what functionality it is they need from that DNA struct.

github-actions[bot] commented 8 months ago

This issue has had no activity in the past 2 months. Marking as stale.