Error at alignment - Githubissues

ryobon-dev commented 2 years ago

Hello,

While running PanAcoTA, I got the following error at the alignment stage. Could you please let me know how I can resolve this issue.

[2022-05-14 16:13:19] : INFO Starting alignment of all families: protein alignment, back-translation to nucleotides, and add missing genomes in the family Alignment: █ 182/4003 ( 4%) - Elapsed Time: 0:00:01 - [2022-05-14 16:13:20] : ERROR fam 894: no file with genes extracted ('align/Align-MAH188/MAH188-current.894.gen'). Cannot align. Alignment: ████████ 1452/4003 ( 36%) - Elapsed Time: 0:00:08 - [2022-05-14 16:13:28] : ERROR fam 9205: no file with genes extracted ('align/Align-MAH188/MAH188-current.9205.gen'). Cannot align. Alignment: ███████████████ 2583/4003 ( 64%) - Elapsed Time: 0:00:15 - [2022-05-14 16:13:34] : ERROR fam 15377: no file with genes extracted ('align/Align-MAH188/MAH188-current.15377.gen'). Cannot align. Alignment: █████████████████ 2919/4003 ( 72%) - Elapsed Time: 0:00:17 - [2022-05-14 16:13:36] : ERROR fam 16627: no file with genes extracted ('align/Align-MAH188/MAH188-current.16627.gen'). Cannot align.

asetGem commented 2 years ago

Hello! It seems that for these families, no gene was extracted. Could you tell me:

how many members you have in these core families (894, 9205, 15377 and 16627)
can you check in the .log.err file if there are errors or warnings before the "ERROR fam894: no file....."

JasmineGamblin commented 1 year ago

Hello, I have the same problem while aligning gene families of a Samonella enterica dataset (I had no problem with an E coli dataset just before):

[2023-09-11 14:40:13] :: INFO :: PanACoTA version 1.4.0 [2023-09-11 14:40:13] :: INFO :: Command used PanACoTA align -c PersGenome_PanGenome-SAEN900.All.prt-clust-0.8-mode1-th5.lst-all_0.99-multi.lst -l LSTINFO-LSTINFO-NA-filtered-0.0001_0.06.lst -n SAEN900 -d . -o . --threads 5 [2023-09-11 14:40:13] :: INFO :: Found 900 genomes. [2023-09-11 14:40:13] :: INFO :: Reading PersGenome and constructing lists of missing genomes in each family. [2023-09-11 14:40:13] :: INFO :: Getting all persistent proteins and classify by strain. [2023-09-11 14:40:23] :: INFO :: Extracting proteins and genes from all genomes [2023-09-11 14:42:01] :: INFO :: Starting alignment of all families: protein alignment, back-translation to nucleotides, and add missing genomes in the family [2023-09-11 15:04:59] :: ERROR :: fam 3884: no file with genes extracted ('./Align-SAEN900/SAEN900-current.3884.gen'). Cannot align. [2023-09-11 16:26:55] :: ERROR :: fam 37492: no file with genes extracted ('./Align-SAEN900/SAEN900-current.37492.gen'). Cannot align.

Family 3884 has 2693 genes (so almost 3 copies in each genome) and family 37492 has 1791 genes. Family 3884 is the largest family in the dataset, but family 37492 is not the second largest. The .log.err file only contains these two errors. I've looked at some of the sequences and they are quite short (~200b). I'm not sure what else to look for and would gladly welcome some help.

aperrin commented 1 year ago

Hi,

That is because when a family has several genes in the same genome, this genome is not taken into account for the alignment (as we want to avoid taking the paralog instead of the ortholog and hence get wrong conclusions). So, if family 3884 has more than 1 gene in all genomes, then it does not contain any sequence to align, which is why you have 'no file with genes extracted'. It is not a question of largest family, but of redundancy in genomes.

This kind of families exist because you used the 'multi' persistent genome, which allows multi copies in all genomes. If your goal is to infer a phylogenetic tree, I would advise you to rather use the mixed persistent, which will ensure you to have a unique member in at least 99% (in your case) of your genomes.

If you still want to use this multi persistent genome, remove these 2 families from the persistent genome file, to ignore them for the tree. However, your phylogenetic tree might be not very precise, as you will have a lot of '-' in the alignment (for a family, if a genome has no member or several members, its sequence is replaced by '-' of the same size of the alignment of existing sequences (see https://aperrin.pages.pasteur.fr/pipeline_annotation/html-doc/usage.html#align-genome-folder for more details).

Hope this will help you. Don't hesitate if something is not clear.

JasmineGamblin commented 1 year ago

Thank you very much for your quick and clear answer! I think I will switch to the mixed persistent option then :)

gem-pasteur / PanACoTA

Error at alignment #34