Add stats to codon tables

bebop / poly

A Go package for engineering organisms.

https://pkg.go.dev/github.com/bebop/poly

MIT License

663 stars 70 forks source link

Add stats to codon tables #350

Closed cachemoi closed 9 months ago

cachemoi commented 11 months ago

This PR takes care of this issue

We want to add a count of the observed start codon occurences in a generated codon table, this might be of interest to check occurences of alternative start codons.

I've added a table test for generateCodonTable, checking that we get the expected full object, and I made it so we add the stats together in AddCodonTable and use the stats from the first table in CompromiseCodonTable (I think that's the intended behavior but let me know if not!)

I realise that the original issue wanted to add the codon stats to the table struct, but after doing this I think it might be better to create a ComputeTableStats(t Table) Stats, or a ComputeStats() function to the table interface instead which does not save the count to the codonTable struct.

This is because otherwise we might forget to update the stats when mutating the table (e.g when calling OptimizeTable, which I think requires updating the start codon stats). Let me know what you think.

Koeng101 commented 11 months ago

This is an interesting example of I think how we could improve the readability of this package. I'd like to know your take @cachemoi

So the problem here is that there are actually two things we're talking about: a codon table, and a codon usage table. A codon table is just the mapping of codons to amino acids (defaultCodonTablesByNumber is the NCBI definitions of each codon table). For translating genes, this is just fine - we can take a set of codons and, since they are a degenerate code, match to 1 amino acid (64->20). For optimizing genes, this is not enough information - we need to know the weights of each codon to amino acid mapping, because if we don't we'll use crappy codons which will significantly impact expression of our proteins (not all codons are made equal!)

What you've done with stats.StartCodonCount[triplet]++ shows this package lacks clear documentation or on the difference between these 2 concepts. It specifically impacts the codon table, but not the codon usage table, which is the interesting bit biologically-speaking (since we already know the the start codons in the codon table).

Hence:

how many gtg start codons there are vs ATG start codons

So actually, all the real magic for the use of this package can be found in the tests

func ExampleOptimize() {
    gfpTranslation := "MASKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTISFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYITADKQKNGIKANFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*"

    sequence, _ := genbank.Read("../../data/puc19.gbk")
    codonTable := codon.GetCodonTable(11)

    // a string builder to build a single concatenated string of all coding regions
    var codingRegionsBuilder strings.Builder

    // initiate genes
    genes := 0
    // iterate through the features of the genbank file and if the feature is a coding region, append the sequence to the string builder
    for _, feature := range sequence.Features {
        if feature.Type == "CDS" {
            sequence, _ := feature.GetSequence()
            // Note: sometimes, genbank files will have annotated CDSs that are pseudo genes (not having triplet codons).
            // This will shift the entire codon table, messing up the end results. To fix this, make sure to do a modulo
            // check.
            if len(sequence)%3 == 0 {
                codingRegionsBuilder.WriteString(sequence)

                // Another good double check is to count genes, then count stop codons.
                genes++
            }
        }
    }

    // get the concatenated sequence string of the coding regions
    codingRegions := codingRegionsBuilder.String()

    // weight our codon optimization table using the regions we collected from the genbank file above
    optimizationTable := codonTable.OptimizeTable(codingRegions)

    // Here, we double check if the number of genes is equal to the number of stop codons
    stopCodonCount := 0
    for _, aa := range optimizationTable.GetAminoAcids() {
        if aa.Letter == "*" {
            for _, codon := range aa.Codons {
                stopCodonCount = stopCodonCount + codon.Weight
            }
        }
    }
    if stopCodonCount != genes {
        fmt.Println("Stop codons don't equal number of genes!")
    }

    optimizedSequence, _ := codon.Optimize(gfpTranslation, optimizationTable)
    optimizedSequenceTranslation, _ := codon.Translate(optimizedSequence, optimizationTable)

    fmt.Println(optimizedSequenceTranslation == gfpTranslation)
    // output: true
}

Ideally, within the loop finding feature sequences we would add in the add for start codons. So, actually what might be really useful here is integrating that example function to be a method of the codon table, such that we can simply call a method and we get a table ready to optimize a protein that just so happens to have start codon stats

cachemoi commented 11 months ago

Thanks for the thorough explanation @Koeng101 ! Sorry about the misunderstanding my biology is (very) rusty 😅

So if I understand correctly the meat of this package is to change a coding sequence so that we use some codons more frequently than other for e.g making translation of a gene present in one specie easier in another.

Looking through the code my instinct is that the Table interface is not really necessary since as far as I can tell there is only a single implentation (codonTable) and it looks like it's only really used as a data container. The 2 main functions that the consumers of the package will use are:

func Translate(sequence string, codonTable Table) (string, error) 
func Optimize(aminoAcids string, codonTable Table, randomState ...int) (string, error)

the Table interface looks like this:

// Table is an interface that specifies the functions that all table types must implement
type Table interface {
    Chooser() (map[string]weightedRand.Chooser, error)
    GenerateTranslationTable() map[string]string
    GenerateStartCodonTable() map[string]string
    GetAminoAcids() []AminoAcid
    GetStartCodons() []string
    GetStopCodons() []string
    IsEmpty() bool
    OptimizeTable(string) Table
}

Apart from the data accessor funcs the functions with extra logic in them are OptimizeTable and Chooser()

My understanding of OptimizeTable is that it computes the codons observed in a sequence and mutates the table to fit those observed codon weights, and then returns the table. Chooser() is only used in Optimize() to give a weighted amino acid "picker"

I think it's nice to have structs which are named like <Thing>er which contains the functions that the package users cares about, only exposing the bare minimum they need to do the tasks. The consumer can write the interfaces themselves (accept interfaces return structs). So for e.g the functionChooser() can be private since it should never really be used outside of Optimize()

What do you think about separating the key functions in small structs? For example:

type Optimizer struct {
    table   codonTable
    chooser weightedRand.Chooser
}

// (previously codonTable.OptimizeTable)
func NewOptimizerFromSequence(sequence string) *Optimizer {
    // code
}

func NewOptimizerFromTable(table codonTable) *Optimizer {
    // code
}

func (o *Optimizer) Optimize(sequence string) string {
    // code

    return optimizedSequence
}

// (Translator could probably live in another package)
type Translator struct {
    codonTable Table
}

func (t *Translator) Translate(sequence string) string {
    // code

       return translatedSequence
}

This would (to me) make the purpose of the package a bit clearer and reduce the number of exposed funcs, making things a bit simpler.

I've pushed an example of what that would look like for the stats computation, the codon.Analyser which computes stats given a genbank sequence. I think it's not exactly what you want (you want some stats on the start codon frequencies observed every time we do a Translate or Optimize) but it's just to give an idea of the structures i usually work with. I'll work more on it to get the stats we want tomorrow.

Let me know if you think that would make the package clearer (not suggesting I'd do this in this PR, just some thoughts).

carreter commented 11 months ago

I know the <action>er interface pattern is idiomatic Go, but I'm personally wary of overusing single-method interfaces.

Separating functions across all these interfaces makes usage more complicated as clients have to instantiate and keep track of an additional object. If all Tables should be able to be optimized + take a sequence to be translated (which IMO they should), these methods should be part of the Table interface.

Reviewing now and leaving more detailed thoughts.

cachemoi commented 11 months ago

I know the er interface pattern is idiomatic Go, but I'm personally wary of overusing single-method interfaces.

Separating functions across all these interfaces makes usage more complicated as clients have to instantiate and keep track of an additional object. If all Tables should be able to be optimized + take a sequence to be translated (which IMO they should), these methods should be part of the Table interface.

If the object instantiation is kept clear and we make it impossible (or at least hard) to end up with an invalid object that's not really an issue (which is why I'm suggesting the NewOptimizerFromTable and NewOptimizerFromSequence constructors)

I'm not overly fond of single method interfaces either (and my suggested names stutter), I was just struggling to come up with a good name for a struct which does both Translation and Optimisation 😅

Just to reiterate the point of those structs is to make interfacing/mocking/testing easier for consumers, if you keep them as top level package functions they become harder to swap out

carreter commented 11 months ago

If the object instantiation is kept clear and we make it impossible (or at least hard) to end up with an invalid object that's not really an issue (which is why I'm suggesting the NewOptimizerFromTable and NewOptimizerFromSequence constructors)

It's an issue for readability, as it adds a new interface + associated struct to our API where we could just have a single function instead.

I'm not overly fond of single method interfaces either (and my suggested names stutter), I was just struggling to come up with a good name for a struct which does both Translation and Optimisation 😅

Translation is going to be something we do with pretty much every codon table, so I think it belongs in the Table interface. I could see an argument for writing it the following way though:

type Translator interface {
  Translate(string) string
}

type Table interface {
  Translator
  // Other methods
}

Just to reiterate the point of those structs is to make interfacing/mocking/testing easier for consumers, if you keep them as top level package functions they become harder to swap out

How so? Take this example:

// Our library code.
func Do(input string) string { //code here }

// Client's code.
type DoFunc func(string) string

type Something struct {
  doFunc DoFunc
  // other members
}

If the client wants to mock Do, they can just write their own function that satisfies the DoFunc type. No need for additional structs/interfaces.

If we could reasonably expect Do to need to store state, then I would understand making a Doer struct. However, all of the functions we're talking about here (namely: Translate, Optimize, and ComputeStartCodonStatistics) should be pure.

Koeng101 commented 11 months ago

So if I understand correctly the meat of this package is to change a coding sequence so that we use some codons more frequently than other for e.g making translation of a gene present in one specie easier in another.

Correct. It is also used in our (gene fixer)[https://github.com/TimothyStiles/poly/blob/main/synthesis/fix/synthesis.go] for fixing up coding sequences so they can be synthesized more easily. So it's basically Translate, Optimize, and whatever fix.Cds uses - which I think is the struct.

the Table interface looks like this:

I think the table interface could definitely be simplifed. I'm not tied to it, and would encourage any simplification.

What do you think about separating the key functions in small structs? For example:

I like it, I'm sold.

it's just to give an idea of the structures i usually work with. I'll work more on it to get the stats we want tomorrow.

Honestly, I think the stats are useful, but separately from the start codon issue. For start codons, it's pretty simple: Just gimme the occurence count of the first 3 base pairs for every protein used. When I think of codon statistics, I'm thinking of codon distribution and rare codon counts, stuff like that. Though beyond all that, I think simplification of this package is more important than stats or start codons - I think you've laid out a way it could be made easier and more readable.

Separating functions across all these interfaces makes usage more complicated as clients have to instantiate and keep track of an additional object. If all Tables should be able to be optimized + take a sequence to be translated (which IMO they should), these methods should be part of the Table interface.

Translation tables are very easy to get (we actually have literally all the translation tables in a map), optimization tables not so much. It might be nice to more easily say "this is a translation table" and "this is a optimization table", since the first we can make very easily accessible, and the second can have methods for generating.

Translation is going to be something we do with pretty much every codon table

Here's some more context: You must know a translation table in order to generate an optimization table. Then an optimization table is just a translation table extended with counts for each codon occurence. I stuck em both in the same struct cause I sucked more at Golang back then. Nowadays I'd probably just have a translation table and then a separate map[string]int to count each codon, and that is where the weights go. An optimization table is just a translation table + that map. It more clearly differentiates what makes them actually different.

Then, we kinda just added functionality on top of that foundation. But fundamentally:

We have a map of amino acids to triplets (so, map[rune][]string - stop codons can just be encoded in this, and ATG assumed to be start for most synthetic biology applications)
We have a map of weights (map[string]int)
The two functions we have are Translate() on top of the 1st map[rune][]string, and Optimize on top of the 2nd map[string]int

Then, we have a couple things we can do with those codon tables:

AddCodonTable to build for multi-chromosomal organisms
CompromiseCodonTable to build for cross-organism synthesis

Of course, there are nice things about the struct right now (it generates clearer JSON, which I consider a win, because I mainly want this to replace the codon usage database at some point), so don't take the map literally - but that said, we can remove literally any code that doesn't help us accomplish those ~4ish tasks, and should only add code that helps us generate or use those 2 distinct tables more effectively.

Koeng101 commented 11 months ago

I guess I'll reiterate that simplification and communication of what we want in this package is more important than any stats about codon tables, because the pretty much the only thing stats of codon tables will be used for is my own edification when I browse through the genbank databank converted to codon tables...

cachemoi commented 11 months ago

I tried to have a go at re-jigging the structures a bit @Koeng101 but this means the PR has grown 😅 . Let me know if it makes sense to you, basically I removed most simple data accessor functions from the Table interface and moved it to a constructor function for a codonTable. I also moved the loop that looked for CDS in most examples to a package function.

This means it's harder to end up with an invalid table, we don't recompute translation maps on every run and the exposed interface is cleaner (most of those functions were onnly used in internal codon package logic)

I did not create a separate OptimizationTable struct because to me it make sense that you could optimise with any table, since optimizing with a table with no special weights would just me a no-op.

You could also repeateadly update weights on the same table, hence the codonTable.UpdateWeights function.

Let me know what you think too @carreter, I removed the separate Analyser struct since we now have what we need in the Table struct functions to compute stats.

Koeng101 commented 11 months ago

I am running some GoldenGate + transformations today, so I'll have to get to this PR a little later - will in next few days!

Thank you for being patient with us :)

Koeng101 commented 9 months ago

@cachemoi

Sorry, one more change: Could you add the feature to the changelog? Then I think ready for merge.

cachemoi commented 9 months ago

Sorry, one more change: Could you add the feature to the changelog? Then I think ready for merge.

I had to rebase to get the changelog on my branch, done in 58a511c