Closed haruosuz closed 3 years ago
It looks like your plasmid sequences are quite divergent, of your core genes (~25% of total genes) almost all of them are ~30-50% sequence divergence. Additionally, many of those genes also show min/max length variation.
Ns/- are added to represent sequence/genes missing in individual isolates e.g. if genomeA does not have a copy of gene1 then the length of the alignment for gene1 will be represented by Ns. Similarly Ns will be added by MAFFT where there are alignment gaps between divergent or different length sequences.
I would suggest you curate the genes you think are useful to align, e.g. have a similar copy number and length, before aligning these genes individually. Your dataset is sufficiently small that this sort of manual inspection/analysis would be warranted and efficient.
I hope that helps.
All the best, Sion
5931919.zip
Attached is a result of comparing four plasmid sequences (CP062120 U67194 AB231906 AM261282) using PIRATE.
There were so many gaps for CP062120 in core_alignment.fasta (94.09% of the 27471 alphabets "-ACGNT" are gaps "-"). I wonder if this is a bug or something?
Regarding whole nucleotide sequence, length is longer in CP062120 (197271 bp) than in the other sequences (AM261282, U67194, and AB231906, ranging from 46557 to 54383 bp), while GC content is not so different between the four sequences (ranging from 60.84 to 65.33).