Panaroo for Phylogenetics

Steps used for running panaroo on WSL2 with results from epi2me-lab/wf-bacterial-genomics with publicly downloaded reference genomes from NCBI

Download data from NCBI reference genomes (only FASTA and GFF selected) for V.cholerae as described here (link to the material). (On Windows)
Unzip the file (On Windows)
Copy the unzipped files to the new location on WSL2 from Windows. The command was cp -r /mnt/c/Users/bajun/Downloads/ncbi_dataset/ncbi_dataset/data resources/public_genomes/vcholerae/
As the gff file in each of the reference genomes has the same name, rename the gff file using the script scripts/03-rename_refseq_gff_for_panaroo.sh
Then create a new folder like data/panaroo_test in which the gff refseq genomes and the prokka gff from epi2me-lab will be located together and copy these gff files to this location
Run panaroo with these gff files to create a core genes for phylogenetics analysis. However, panaroo will likely throw this error RuntimeError: Error reading prokka input! due to the inherent NCBI GFF refseq files error. The files need to be converted to be the same as GFF from prokka which was used by epi2me and are supported by panaroo. Use the script convert_refseq_to_prokka_gff.py from "https://github.com/gtonkinhill/panaroo". Run this script on all GFF files from NCBI refseq. Use the script scripts/convert_refseq_to_prokka.sh for that. If you did the step five you need first to delete all NCBI refseq GFF files from that folder before running script
Then run Panaroo again.

After reformatting the GFF Panaroo managed to run

You can ignore step 6 by including the paremter --remove-invalid-genes so the command will look like this one panaroo -i *.gff -o results --clean-mode strict --remove-invalid-genes

cambiotraining / awd-pathogen-bioinformatics