Extracting core gene alignment

ramadatta commented 3 years ago

Hi @rfm-targa and authors,

Thank you very much for writing an amazing tool.

I have an external schema for stenotrophomonas through which I want to do cgMLST analysis and then generate a core gene alignment. May I know if the alignment could be created using cgMLST_completegenomes/Presence_Absence.tsv? May I know, if there is there any easy way to do with chewBBACA? Can request your advice. Thanks very much in advance!

Edit1: Hi. With a bit of reading, I think what I need is: To generate core gene alignment using cgMLST_completegenomes/cgMLST.tsv. Can I kindly request advice if my approach is correct and help in generating alignment so that I do not need to reinvent the wheel.

ramirma commented 3 years ago

Dear @ramadatta,

Thank you for your inquiry and for your kind words. The current version of chewBBACA has no tool to specifically do what you ask, although this is definitely in our future plans. If you want to have a core gene alignment what you would need to do would be to align the alleles at every locus which you can find in the chewBBACA directory and then concatenate the alleles for each isolate.

Hope this clarified your question.

ramadatta commented 3 years ago

Thanks @ramirma for swift reply. I see your point.

A) Can I just clarify, the approach you explained, seems to be a standalone alignment approach such as using alignment tools such as BLAST and aligning the alleles in the each locus fasta separately on to the genomes and extract the matched genes and form an alignment?

B) But, I am more inclined to looking into the approach to generate the core gene alignment based on allele call tables like this (which is from my understanding available in cgMLST_completegenomes/cgMLST.tsv) as below table. This is because, my standalone BLAST results of allele assignment and chewBBACA's allele assignment for a sample may possibly differ due to difference in the parameters.

FILE    G1      G2      G3      G4      G5      G6
S1      1       INF-2   3       2       1       5
S2      1       1       1       1       NIPH    5
S3      1       2       3       4       1       3
S4      1       LNF     2       4       1       3
S5      1       2       ASM     2       1       3
S6      2       INF-8   3       PLOT3   PLOT5   3

Therefore, in such case, I am inclined to generate a core gene alignment based on chewBBACA's allele assignment for each sample rather than the results obtained from a standalone tool. Please clarify if my interpretation is correct. Thank you very much!

rfm-targa commented 3 years ago

Hello @ramadatta,

In your first comment you said that the schema you are using is an external schema. To start using an external schema you should start by running the PrepExternalSchema process to create a version of that schema that is compatible with chewBBACA. This step is necessary to ensure that the schema does not include invalid alleles (chewBBACA allele calling algorithm enforces the condition of complete coding sequences. Sequences with length value that is not a multiple of 3, invalid start or stop codons, that contain ambiguous bases or have internal stop codons are considered to be invalid alleles).

After the schema adaptation process you can perform allele calling with the AlleleCall process to determine the allelic profiles of the samples of interest. If the external schema that you have adapted is a wgMLST schema you will need to run the ExtractCgMLST process to determine the set of loci/genes that constitute the core genome based on the set of samples that you have classified. If the schema you have adapted was already a cgMLST schema you can skip the cgMLST determination step.

The approach you suggest in point B) is the correct approach to generate the core-genome alignment. chewBBACA does not provide functions to generate MSAs for the core genes, but it includes MAFFT and Clustal as dependencies (used for MSA in the SchemaEvaluator process). You can use one of those dependencies to compute MSAs. You should start by creating new FASTA files with the alleles identified in your samples. For each column/gene in the cgMLST.tsv file, you will have to get the alleles DNA sequences from the schema's files and write those sequences to a new FASTA file. The FASTA files that contain the alleles can be found in the schema's directory. Each FASTA file in the schema's directory corresponds to a gene and has all alleles for that gene.

If your cgMLST.tsv file has the following column:

And the G1.fasta file in the schema's directory has the following structure:

>G1_1
ATGAAA...
>G1_2
ATGTTT...
>G1_3
ATGGGG...

You should get the DNA sequence identified for each sample and generate a FASTA file with the following structure:

>S1_G1_1
ATGAAA...
>S2_G1_2
ATGTTT...
>S3_G1_1
ATGAAA...

After performing this step for all columns/genes in the cgMLST.tsv file you can use MAFFT to compute a MSA for the sequences in each FASTA file. You can concatenate all MSAs to get the core-genome alignment. When you create the FASTA files with the alleles in all samples you can include a header followed by an empty line for LNF, ASM, PLOT, NIPHclassifications. For classifications like INF-2 you should get the allele with identifier 2 or remove the INF- prefix.

Let us know if this is enough to clarify any doubts and help you start performing the analysis to get a core-genome alignment.

Best regards,

rfm

ramadatta commented 3 years ago

Hi @rfm-targa . Thank you so much for a comprehensive answer. Your post pretty much clarifies all my questions. Let me read a bit more about the LNF, ASM, PLOT, NIPH classifications and start working towards to generating a core genome alignment. If nothing you can close this issue. Thanks so much again.

B-UMMI / chewBBACA_tutorial

Extracting core gene alignment #7