SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
88 stars 29 forks source link

Question about core genome concatenation #56

Closed rx32940 closed 3 years ago

rx32940 commented 3 years ago

Dear developers of PIRATE, I have run PIRATE for my dataset of bacteria isolates. However, for my follow-up analysis, I want to reduce the size of the core-genome alignment as small as possible. Thus, instead of using a nucleotide concatenation of the core-genome, I was hoping to create an amino acid core genome alignment. I had read through the PIRATE manual and realized the nucleotide core-genome alignment was created from the concatenation of amino acid gene alignments and tried to work with the create_pangenome_alignment.pl script to create an amino acid core genome alignment by concatenating amino acid gene sequences in the feature_sequences folder. However, I couldn't figure out how to deal with the samples with paralogs of a gene during concatenation. Please correct me if I am wrong. when I was exploring the PIRATE.gene_families.ordered.tsv file, I found that the core_genome_alignment.fasta file produced with the --align option used genes with the threshold for the avg_dosage as 1.25 instead of 1, although the dosage threshold set within the script seems to be 1.

Thank you so much in advance for helping me with my question.

Best, Rachel

SionBayliss commented 3 years ago

Hi Rachel,

I did something similar a while back but I hadn't added it to the PIRATE repo yet. I have just pushed a version of this script to master. You can find it in the tools/subsetting/create_pangenome_alignment_aa.pl. It works in the same way as the pangenome alignment script but also accepts the -a/--amino-acid flag to process the amino acid rather than nucleotide files. Please let me know if it works well as I have not tested it extensively.

You are correct about the dosage thresholds. I used 1 as default in the create_pangenome_alignment script but 1.25 when PIRATE is run. This is just to ensure that one or two poor quality/contaminated genomes do not have an oversized impact on the outputs.

All the best, Sion

rx32940 commented 3 years ago

Dear Sion,

Thank you so much for your answer. This has helped a lot with my analysis.

Best, Rachel

SionBayliss commented 3 years ago

No problem!