gem-pasteur / PanACoTA

PANgenome with Annotations, COre identification, Tree and corresponding Alignments
GNU Affero General Public License v3.0
71 stars 8 forks source link

Error at alignment #34

Closed ryobon-dev closed 12 months ago

ryobon-dev commented 2 years ago

Hello,

While running PanAcoTA, I got the following error at the alignment stage. Could you please let me know how I can resolve this issue.

asetGem commented 2 years ago

Hello! It seems that for these families, no gene was extracted. Could you tell me:

JasmineGamblin commented 1 year ago

Hello, I have the same problem while aligning gene families of a Samonella enterica dataset (I had no problem with an E coli dataset just before):

Family 3884 has 2693 genes (so almost 3 copies in each genome) and family 37492 has 1791 genes. Family 3884 is the largest family in the dataset, but family 37492 is not the second largest. The .log.err file only contains these two errors. I've looked at some of the sequences and they are quite short (~200b). I'm not sure what else to look for and would gladly welcome some help.

aperrin commented 1 year ago

Hi,

That is because when a family has several genes in the same genome, this genome is not taken into account for the alignment (as we want to avoid taking the paralog instead of the ortholog and hence get wrong conclusions). So, if family 3884 has more than 1 gene in all genomes, then it does not contain any sequence to align, which is why you have 'no file with genes extracted'. It is not a question of largest family, but of redundancy in genomes.

This kind of families exist because you used the 'multi' persistent genome, which allows multi copies in all genomes. If your goal is to infer a phylogenetic tree, I would advise you to rather use the mixed persistent, which will ensure you to have a unique member in at least 99% (in your case) of your genomes.

If you still want to use this multi persistent genome, remove these 2 families from the persistent genome file, to ignore them for the tree. However, your phylogenetic tree might be not very precise, as you will have a lot of '-' in the alignment (for a family, if a genome has no member or several members, its sequence is replaced by '-' of the same size of the alignment of existing sequences (see https://aperrin.pages.pasteur.fr/pipeline_annotation/html-doc/usage.html#align-genome-folder for more details).

Hope this will help you. Don't hesitate if something is not clear.

JasmineGamblin commented 1 year ago

Thank you very much for your quick and clear answer! I think I will switch to the mixed persistent option then :)