Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

Request: use updated codon usage tables #62

Open eggrandio opened 2 years ago

eggrandio commented 2 years ago

Hello,

I have a request that should enhance the utility of DNA Chisel for codon optimization.

When selecting Codon Usage Tables for species not included in DNA Chisel, Kazusa Codon tables are used. These codon tables are extremely outdated, generated from Genbank data only up to 2007. For some species, there is only one CDS used, and some codons have a 0 usage (see here for examples).

I suggest that more up-to-date codon tables should be included in DNA chisel. For example, CoCoPUTs is an up-to-date database with codon usage tables generated with each new Genbank or Refseq release every three months. I guess it should not be hard to retrieve the usage tables for each species (there is a huge file with the data, but I assume they could be retrieved "on the fly" too).

I am in no way associated with the people that run CoCoPUTs, but I think it is the most up-to-date resource for codon usage tables.

Zulko commented 2 years ago

Good points thanks for the report. When you use a TaxID, Chisel will use the codon-usage-tables library to get that dictionnary, so I guess the fix should happen at the level of this other library. From what I remember I chose Kazusa because it was easy to query and extensive. Do you know if there is any way to quickly get a codon table from CoCoPUTs without downloading the 19Gb files they have on their website?

In case this helps unblock you, Chisel's codon optimizers also accept arbitrary codon table as a dictionnary (see here), so you could download CoCoputs, extract the table you want, and feed it to the constraint.

eggrandio commented 2 years ago

Thanks for the quick reply!

I have only used the CoCoPUTs database briefly (actually, last time I used it, I downloaded a codon usage table to use as a "custom dictionary" with DNA chisel, as you suggested). It really looks like there is no direct way of downloading the Codon Usage Tables (and much less a way to automatically query their database). I guess that would require talking directly to the person that manages it...

An alternative would be to generate .csv from the huge 19 GB file but that would require storing them somewhere and also updating them frequently.

I do not see a straightforward implementation of this, but keep it mind for future updates, as I think having updated codon usage tables would really enhance the use of DNA chisel!

Best,

gjerman commented 2 years ago

Here is a direct way to extract codon usage data from a specific taxid from the CoCoPUTs database:

Changing the taxid parameter in the below url, will return codon usage data in a simple easily parsable text format, along with statistics on how mange individual codon and CDS was used to calculate the table.

https://hive.biochemistry.gwu.edu/dna23.cgi?cmd=ionTaxidCollapseExt&svcType=svc-codon-usage&fileSource=Refseq_species.tsv&taxid=1423&filterInColName=[%22Organelle%22]&filterIn=[%22genomic%22]&searchDeep=true&raw=1

id,value
taxid,1423
collapse,986
"#codon",1183497316
"#CDS",4075841
"GC%",44.42
"GC1%",52.37
"GC2%",35.94
"GC3%",44.96
TTT,36168823
TTC,16952176
TTA,22452085
TTG,18098366
CTT,27357279
CTC,13017039
CTA,5611466
CTG,27894851
ATT,43533789
ATC,32210689
ATA,10977071
ATG,31999676
GTT,22371988
GTC,20859738
GTA,15636787
GTG,21085945
TAT,26654623
TAC,14288712
TAA,2630156
TAG,610566
CAT,17949902
CAC,8839816
CAA,23027307
CAG,22235923
AAT,25940596
AAC,20299079
AAA,58451506
AAG,24769745
GAT,38706860
GAC,22214128
GAA,58090962
GAG,27236588
TCT,14970800
TCC,9552554
TCA,17384426
TCG,7555205
CCT,12218495
CCC,3886203
CCA,8000446
CCG,19206435
ACT,9990445
ACC,10335004
ACA,26082875
ACG,17628114
GCT,22178501
GCC,19117553
GCA,25193343
GCG,24502320
TGT,4232670
TGC,5147835
TGA,1071397
TGG,12100431
CGT,8873562
CGC,10190764
CGA,4738179
CGG,7797321
AGT,7690511
AGC,16722825
AGA,12657475
AGG,4551562
GGT,14885244
GGC,27995359
GGA,25656310
GGG,13208915
veghp commented 2 years ago

Thank you @gjerman , that is useful. Then to obtain a species table, it just needs adding the translation column and calculating the % to get a format as in here: https://github.com/Edinburgh-Genome-Foundry/codon-usage-tables/blob/master/codon_usage_data/tables/h_sapiens_9606.csv

Is it possible to retrieve the % instead of counts?

As a related note, @Zulko do the frequencies need to add up exactly to 1? For example the stop codon in the link adds up to 0.30 + 0.24 + 0.47 = 1.01.

gjerman commented 2 years ago

This is not a documented api endpoint, but the endpoint they use in their own webapp.

There are several url parameters that can be set in the url, but i have not found one the will return the percentage directly.

Here is the original url that i simplified by removing parameters that did not obviously change the results.

https://hive.biochemistry.gwu.edu/dna23.cgi?cmd=ionTaxidCollapseExt&svcType=svc-codon-usage&objId=586358&fileSource=Refseq_species.tsv&plen=3&taxid=1423&filterInColName=[%22Organelle%22]&filterIn=[%22genomic%22]&searchDeep=true&raw=1&raw=1&bust=1631102057812

veghp commented 2 years ago

Thanks for the details; if I search for a species with the web interface, there is an option to return a table, and that has percentages as well. But maybe those are calculated from the count numbers.

Zulko commented 2 years ago

@Zulko do the frequencies need to add up exactly to 1? For example the stop codon in the link adds up to 0.30 + 0.24 + 0.47 = 1.01

@veghp No need for the frequencies to add up exactly to one in DNA Chisel, it only cares about which codon is the most frequent (for best-codon optimization) or the difference between a given codon variant's frequency in the sequence and in the table.

Great that there is a way to query CoCoPUTs, would be nice to have a choice between kasuza and cocoputs in the codons_usage_table repo (not sure if they cover the exact same species).

eggrandio commented 2 years ago

This is not a documented api endpoint, but the endpoint they use in their own webapp.

There are several url parameters that can be set in the url, but i have not found one the will return the percentage directly.

Here is the original url that i simplified by removing parameters that did not obviously change the results.

https://hive.biochemistry.gwu.edu/dna23.cgi?cmd=ionTaxidCollapseExt&svcType=svc-codon-usage&objId=586358&fileSource=Refseq_species.tsv&plen=3&taxid=1423&filterInColName=[%22Organelle%22]&filterIn=[%22genomic%22]&searchDeep=true&raw=1&raw=1&bust=1631102057812

  • objId=586358 and bust=1631102057812 seems to be for tracking purposes and changes with each session in the webapp
  • raw=1 sets output to text (raw=1&raw=1 in the original url must be a small bug in their app)
  • searchDeep=true returns a higher number of CDS and codon, than setting searchDeep=false
  • plen=3 i am not sure of the significance of but the results are the same if i change it.
  • filterInColName=[%22Organelle%22]&filterIn=[%22genomic%22] allow for filtering of specific organels like mitocondria, but i think genomic is the right option.
  • cmd=ionTaxidCollapseExt&svcType=svc-codon-usage are needed to produce any results

I am trying to obtain from python the codon tables but I am getting this error:

url = "https://hive.biochemistry.gwu.edu/dna23.cgi?cmd=ionTaxidCollapseExt&svcType=svc-codon-usage&fileSource=Refseq_species.tsv&taxid=71240&filterInColName=%5B%22Organelle%22%5D&filterIn=%5B%22genomic%22%5D&searchDeep=true&raw=1"
file = urllib.request.urlopen(url)

test = file.read()

print(test)

b'error:taxonomy database is missing\n'

I think it has something to do with cookies as I was getting the same message if I tried to access that url from a computer for the first time without having visited the CoCoPuTs webpage first. Once I visited I could access any codon table from my browser by changing the Taxonomy ID.

Is there a way of doing the same form within python? I have seen that mechanize can mimic browsing a page, but I am quite new to python so I wouldn't know how to do it...

veghp commented 2 years ago

I can confirm your issue. This is the best answer I could find: https://stackoverflow.com/questions/19098518/how-to-download-a-file-perl-cgi-backend-using-python-requests?noredirect=1&lq=1

A better solution seems to be to download the whole dataset instead, and subset that for the query parameters.

gjerman commented 2 years ago

This is a issues with the session cookie. Here is an improved example where i fetch a session cookie and subsequently use that when fetching the codon data.

import requests

def fetch_codon_table(taxid: int) -> str:
    url_for_cookie = "https://hive.biochemistry.gwu.edu/dna.cgi?cmd=login&login=hive.reviewer&pswd=Review123"
    url_for_codon_table = f'https://hive.biochemistry.gwu.edu/dna.cgi?cmd=ionTaxidCollapseExt&svcType=svc-codon-usage&objId=586358&fileSource=Refseq_species.tsv&filterInColName=["Organelle"]&filterIn=["genomic"]&searchDeep=true&raw=1&taxid={taxid}'
    with requests.Session() as session:
        try:
            session.get(url_for_cookie)
            res = session.get(url_for_codon_table)
        except Exception as e:
            raise e
    return res.text

if __name__ == "__main__":

    print(fetch_codon_table(1423))
veghp commented 2 years ago

That works for me, thanks

pvgb commented 2 years ago

A quick question on codon_usage_table specification within DNAChisel. In the documentation you mention that we should be giving the RSCU table (relative usage of each codon) rather than straight frequencies (that add up to 1 for each aa).

How does this affect the AvoidRareCodons specification? Does that directly use the frequencies specified? If yes, then we probably shouldnt be providing the RCSU table but a straight codon frequency table right?

Thanks

veghp commented 2 years ago

The documentation you refer to is: https://edinburgh-genome-foundry.github.io/DnaChisel/ref/builtin_specifications.html?highlight=codon_usage_table#avoidrarecodons

DNA Chisel uses this repository for codon tables: https://github.com/Edinburgh-Genome-Foundry/codon-usage-tables

An example table: https://github.com/Edinburgh-Genome-Foundry/codon-usage-tables/blob/master/codon_usage_data/tables/e_coli_316407.csv

As you can see, frequencies add up to 1, separately for each codon.

You are right, we need codon frequencies instead of RSCU. The relative synonymous codon usage (RSCU) is not what its name suggests. Formula and an actual table.

Thanks, this will be removed from the documentation in the next release.