cambiotraining / awd-pathogen-bioinformatics

Materials for "Introduction to Cholera Genomics" course
https://cambiotraining.github.io/awd-pathogen-bioinformatics
Other
1 stars 1 forks source link

Panaroo for Phylogenetics #8

Closed bsalehe closed 9 months ago

bsalehe commented 11 months ago

Steps used for running panaroo on WSL2 with results from epi2me-lab/wf-bacterial-genomics with publicly downloaded reference genomes from NCBI

  1. Download data from NCBI reference genomes (only FASTA and GFF selected) for V.cholerae as described here (link to the material). (On Windows)
  2. Unzip the file (On Windows)
  3. Copy the unzipped files to the new location on WSL2 from Windows. The command was cp -r /mnt/c/Users/bajun/Downloads/ncbi_dataset/ncbi_dataset/data resources/public_genomes/vcholerae/
  4. As the gff file in each of the reference genomes has the same name, rename the gff file using the script scripts/03-rename_refseq_gff_for_panaroo.sh
  5. Then create a new folder like data/panaroo_test in which the gff refseq genomes and the prokka gff from epi2me-lab will be located together and copy these gff files to this location
  6. Run panaroo with these gff files to create a core genes for phylogenetics analysis. However, panaroo will likely throw this error RuntimeError: Error reading prokka input! due to the inherent NCBI GFF refseq files error. The files need to be converted to be the same as GFF from prokka which was used by epi2me and are supported by panaroo. Use the script convert_refseq_to_prokka_gff.py from "https://github.com/gtonkinhill/panaroo". Run this script on all GFF files from NCBI refseq. Use the script scripts/convert_refseq_to_prokka.sh for that. If you did the step five you need first to delete all NCBI refseq GFF files from that folder before running script
  7. Then run Panaroo again.

After reformatting the GFF Panaroo managed to run