SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
88 stars 29 forks source link

gaps in core_alignment.fasta #65

Closed haruosuz closed 3 years ago

haruosuz commented 3 years ago

5931919.zip

Attached is a result of comparing four plasmid sequences (CP062120 U67194 AB231906 AM261282) using PIRATE.

There were so many gaps for CP062120 in core_alignment.fasta (94.09% of the 27471 alphabets "-ACGNT" are gaps "-"). I wonder if this is a bug or something?

Regarding whole nucleotide sequence, length is longer in CP062120 (197271 bp) than in the other sequences (AM261282, U67194, and AB231906, ranging from 46557 to 54383 bp), while GC content is not so different between the four sequences (ranging from 60.84 to 65.33).

SionBayliss commented 3 years ago

It looks like your plasmid sequences are quite divergent, of your core genes (~25% of total genes) almost all of them are ~30-50% sequence divergence. Additionally, many of those genes also show min/max length variation.

Ns/- are added to represent sequence/genes missing in individual isolates e.g. if genomeA does not have a copy of gene1 then the length of the alignment for gene1 will be represented by Ns. Similarly Ns will be added by MAFFT where there are alignment gaps between divergent or different length sequences.

I would suggest you curate the genes you think are useful to align, e.g. have a similar copy number and length, before aligning these genes individually. Your dataset is sufficiently small that this sort of manual inspection/analysis would be warranted and efficient.

I hope that helps.

All the best, Sion