B-UMMI / chewBBACA

BSR-Based Allele Calling Algorithm
GNU General Public License v3.0
136 stars 28 forks source link

Can I use or add a different translation table for the genetic code (working with parasites)? #200

Open azmigueldario opened 5 months ago

azmigueldario commented 5 months ago

I want to use the ideal table for the parasite I am working on but it is not supported. I could not find where the functions inherit the table to modify the code.

Would it be possible and simple to add custom translation tables?

Great tool by the way.

rfm-targa commented 5 months ago

Greetings @azmigueldario,

Thank you for your interest. Most modules in the latest version, v3.3.6, allow users to select or autodetect the genetic code. The CreateSchema and AlleleCall modules allow you to specify the genetic code through the --t, --translation-table parameter. chewBBACA should support all genetic codes listed here. If it is not listed there, I suggest going with The Standard Code (1) or the one that might lead to the closest results.

Kind regards,

Rafael

azmigueldario commented 5 months ago

Thank you for the quick reply.

I will use the standard code then (#1). I saw that the tables are restricted to a few most commonly used and related to bacterial pathogens, see error output below.

I believe the table is used in a function imported from Bio.seq to translate into protein space. It is not super vital for me but it may be worth it to remove the restriction just in case or add a warning if people end up using a weird translation reference table.

Thanks again, Miguel

Authors: Rafael Mamede, Pedro Cerqueira, Mickael Silva, João Carriço, Mário Ramirez
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==================================
  chewBBACA - PrepExternalSchema
==================================
Started at: 2024-06-27T17:06:30

Invalid genetic code value.
Value must correspond to one of the accepted genetic codes

Accepted genetic codes:

        1: Standard
        4: The mold, protozoan, and coelenterate mitochondrial code and the mycoplasma/spiroplasma code
        11: The Bacterial, Archaeal and Plant Plastid code
        25: Candidate division SR1 and gracilibacteria code
rfm-targa commented 5 months ago

Hello @azmigueldario,

I must be playing Jedi mind tricks on myself since I forgot about that step to validate the genetic code. It should accept more than those four genetic codes. I will add more genetic codes to the dictionary with the accepted values so that it still validates the value passed. This does not guarantee that it will work for any organism; it still depends on Pyrodigal/Prodigal, which was designed for Bacteria and Archaea. What is the genetic code that you would like to use?

Best regards,

Rafael

azmigueldario commented 5 months ago

Hi Rafael, thank you for your reply.

I am interested in code number 6 and you are right that it may not work properly, I am working with Giardia which seems to have a somewhat similar genome structure to bacteria so I hope it works.

On Fri, Jun 28, 2024 at 2:45 AM Rafael Mamede @.***> wrote:

Hello @azmigueldario https://github.com/azmigueldario,

I must be playing Jedi mind tricks on myself since I forgot about that step to validate the genetic code. It should accept more than those four genetic codes. I will add more genetic codes to the dictionary with the accepted values so that it still validates the value passed. This does not guarantee that it will work for any organism; it still depends on Pyrodigal/Prodigal, which was designed for Bacteria and Archaea. What is the genetic code that you would like to use?

Best regards,

Rafael

— Reply to this email directly, view it on GitHub https://github.com/B-UMMI/chewBBACA/issues/200#issuecomment-2196525538, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANF5JEHQO2CAI3BLO4TZUQ3ZJUWEHAVCNFSM6AAAAABJ7CHTUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJWGUZDKNJTHA . You are receiving this because you were mentioned.Message ID: @.***>

-- Miguel D. Prieto-Gaez, MD, MSc (He, Him)

rfm-targa commented 5 months ago

Hello @azmigueldario,

We released chewBABCA v3.3.8, which adds support for the remaining genetic codes supported by Prodigal (complete list here), including genetic code 6. I tested the new options with Giardia genomes available on the NCBI. I downloaded the reference genome for Giardia intestinalis (GCF_000002435.2) and created Prodigal training files based on that genome and genetic codes 1 and 6. I used the following commands:

Genetic code 1:

prodigal -i GCF_000002435.2_UU_WB_2.1_genomic.fna -t giardia_gc1.trn -p single -g 1

Genetic code 6:

prodigal -i GCF_000002435.2_UU_WB_2.1_genomic.fna -t giardia_gc6.trn -p single -g 6

I then used the reference genome and the training files to create a schema for each genetic code with the CreateSchema module. After that, I downloaded all the Giardia genomes (n=38, 36 Giardia intestinalis, 1 Giardia muris, 1 Giardia lamblia) from the NCBI and performed allele calling with the AlleleCall module to identify new alleles to add to the schemas. Here are the total number of loci and alleles after allele calling:

Schema #Loci #Alleles
Genetic code 1 4,881 76,096
Genetic code 6 4,722 51,512

To get an idea about the number of loci that Prodigal might be predicting well, I used the UniprotFinder module to compare the schema loci against the Giardia reference proteomes available on UniProt (n=3, UP000001548, UP000315496, UP000000350). It found annotations for loci in both schemas. Still, the loci in the schema created with genetic code 1 were more similar to what's in the reference proteome for Giardia intestinalis (found proteome annotations for 4,594 loci in the schema created with genetic code 1 and for 3,540 loci in the schema created with genetic code 6). The Giardia muris genome seems to differ considerably from the reference genome for Giardia intestinalis, so the schemas could not classify most CDSs predicted for Giardia muris. Several Giardia intestinalis available on the NCBI seem to be of low quality (e.g. highly fragmented or scaffolded), which can lead to high numbers of missing/non-identified loci for those genomes and a small core genome if you determine the core loci from results including those genomes. This was to test whether it ran without errors and whether it might work, which it might, at least to some extent. Let us know how it goes. I hope it works!

Best regards,

Rafael

azmigueldario commented 5 months ago

Thank you very much @rfm-targa for adding the table and taking the time to look into the functionality for Giardia.

I will likely stay with the standard table or run both to compare. Thank you very much for all the help.