UPHL-BioNGS / Grandeur

UPHL's Reference Free Pipeline
GNU General Public License v3.0
23 stars 7 forks source link

Pipeline Test files #216

Open arunbodd opened 1 month ago

arunbodd commented 1 month ago

Hello Developer,

Can you please provide at least a test.config with test fasta files to run this pipeline and understand the output ?

Thank you.

erinyoung commented 1 month ago

My apologies for my late reply!

Generally we use Grandeur with fasta files for two things:

  1. QC and species estimation from long-read assembly
  2. Phylogenetic analysis

I don't have this built in to Grandeur (it's a long story, but a lot of sites are blocked locally - such as the ENA)

For phylogenetic analysis, this is what we use for testing with github actions (I'm making the assumption you're curious about the phylogenetic analysis):

Step 1. Get fasta files for the same species (they need to share 1500 genes)

mkdir fastas
cd fastas
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/013/783/245/GCA_013783245.1_ASM1378324v1/GCA_013783245.1_ASM1378324v1_genomic.fna.gz && gzip -d GCA_013783245.1_ASM1378324v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/026/626/185/GCA_026626185.1_ASM2662618v1/GCA_026626185.1_ASM2662618v1_genomic.fna.gz && gzip -d GCA_026626185.1_ASM2662618v1_genomic.fna.gz 
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/020/808/985/GCA_020808985.1_ASM2080898v1/GCA_020808985.1_ASM2080898v1_genomic.fna.gz && gzip -d GCA_020808985.1_ASM2080898v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/904/863/225/GCA_904863225.1_KSB1_6J/GCA_904863225.1_KSB1_6J_genomic.fna.gz           && gzip -d GCA_904863225.1_KSB1_6J_genomic.fna.gz
cd ../

Step 2A. Then run the workflow

nextflow run . -profile docker,msa --fastas fastas

OR

Step 2B. Create the list of fastas and then run the workflow

Instead of pointing the workflow to a directory, a list of fasta files can be used instead. This must be the option used if using cloud resources.

Creating the fasta list

ls fastas/* > fastas.txt

fastas.txt should have file contents like so

fastas/GCA_013783245.1_ASM1378324v1_genomic.fna
fastas/GCA_026626185.1_ASM2662618v1_genomic.fna
fastas/GCA_020808985.1_ASM2080898v1_genomic.fna
fastas/GCA_904863225.1_KSB1_6J_genomic.fna

Running the workflow

nextflow run . -profile docker,msa --fasta_list fastas.txt

Step 3. Looking at results:

This gives a summary file with 1-2 key results from each analysis.

sample  file    version per_core_genome_genes   warnings    amrfinder_genes_(per_cov/per_ident) predicted_organism  mlst_matching_pubmlst_scheme    mlst_st fastani_top_organism    fastani_top_reference   fastani_top_ani_estimate    fastani_top_total_query_sequence_fragments  fastani_top_fragments_aligned_as_orthologous_matches    mash_reference  mash_mash-distance  mash_p-value    mash_matching-hashes    mash_organism   plasmidfinder_plasmid_(identity)    kleborate_virulence_score   kleborate_resistance_score
GCA_013783245.1_ASM1378324v1_genomic    GCA_013783245.1_ASM1378324v1_genomic.fna    4.5.24184   84.42   Multiple FastANI hits,Low core genes,   ['arsA (100.00/98.63)', 'arsB (100.00/98.83)', 'arsC (100.00/99.29)', 'arsD (100.00/90.83)', 'arsR (100.00/98.28)', 'blaSHV-11 (100.00/100.00)', 'emrD (100.00/99.49)', 'fieF (100.00/100.00)', 'fosA (100.00/99.28)', 'oqxA (100.00/100.00)', 'oqxB (100.00/100.00)', 'pcoA (100.00/100.00)', 'pcoB (100.00/100.00)', 'pcoC (100.00/100.00)', 'pcoD (100.00/99.68)', 'pcoE (100.00/97.92)', 'pcoR (100.00/100.00)', 'pcoS (100.00/99.57)', 'pmrB_R256G (100.00/99.45)', 'silA (100.00/98.85)', 'silB (100.00/97.91)', 'silC (100.00/99.35)', 'silE (100.00/92.31)', 'silF (100.00/99.15)', 'silP (99.64/94.53)', 'silR (100.00/98.23)', 'silS (100.00/98.78)'] Klebsiella_pneumoniae   klebsiella  37  Klebsiella_pneumoniae   Klebsiella_pneumoniae_GCF_000240185.1.fna.gz    99.124  1653    1791    refseq-NZ-1328379-PRJNA224116-SAMN02138587-GCF_000567645.1-.-Klebsiella_pneumoniae_MGH_47.fna   0.00305472  0   883/1000    Klebsiella_pneumoniae   ['Col440I (92.73)', 'IncFIB(K) (98.93)', 'IncFII(K) (100.0)']   0   0
GCA_020808985.1_ASM2080898v1_genomic    GCA_020808985.1_ASM2080898v1_genomic.fna    4.5.24184   84.62   Multiple FastANI hits,Low core genes,   ['blaSHV-11 (100.00/100.00)', 'emrD (100.00/99.49)', 'fieF (100.00/100.00)', 'fosA (100.00/100.00)', 'fosA7 (65.71/91.30)', 'oqxA (100.00/100.00)', 'oqxB (100.00/99.81)']  Klebsiella_pneumoniae   klebsiella  1017    Klebsiella_pneumoniae   Klebsiella_pneumoniae_GCF_022869665.1.fna.gz    99.0812 1579    1779    refseq-NZ-1438805-PRJNA224116-SAMN02581266-NZ_JJNJ-.-Klebsiella_pneumoniae_UCI_60.fna   0.00823165  0   726/1000    Klebsiella_pneumoniae   ['IncFIB(pKPHS1) (99.46)']  0   0
GCA_026626185.1_ASM2662618v1_genomic    GCA_026626185.1_ASM2662618v1_genomic.fna    4.5.24184   82.57   Multiple FastANI hits,Low core genes,   "['aac(3)-IVa (100.00/100.00)', 'aadA1 (100.00/100.00)', 'aadA2 (100.00/100.00)', 'aadA2 (100.00/100.00)', ""aph(3'')-Ib (100.00/100.00)"", ""aph(3'')-Ib (100.00/100.00)"", ""aph(3'')-Ib (100.00/99.63)"", ""aph(3')-IIa (100.00/100.00)"", ""aph(3')-Ia (100.00/100.00)"", 'aph(4)-Ia (100.00/100.00)', 'aph(6)-Id (100.00/100.00)', 'aph(6)-Id (100.00/100.00)', 'aph(6)-Id (100.00/100.00)', 'armA (100.00/100.00)', 'blaCTX-M-14 (100.00/100.00)', 'blaDHA-1 (100.00/100.00)', 'blaSHV-25 (100.00/100.00)', 'blaTEM-1 (100.00/100.00)', 'ble (61.11/96.20)', 'cmlA1 (100.00/100.00)', 'dfrA12 (100.00/100.00)', 'emrD (100.00/99.49)', 'fieF (100.00/100.00)', 'floR (100.00/99.75)', 'fosA (99.28/100.00)', 'fosA3 (100.00/100.00)', 'gyrA_S83I (100.00/99.77)', 'mph(A) (100.00/100.00)', 'mph(E) (100.00/100.00)', 'msr(E) (100.00/100.00)', 'oqxA (100.00/100.00)', 'oqxB (100.00/100.00)', 'parC_S80I (98.84/99.41)', 'qacE (82.61/95.79)', 'qacEdelta1 (100.00/100.00)', 'qacL (100.00/100.00)', 'qnrB4 (100.00/100.00)', 'qnrS1 (100.00/100.00)', 'rmtB1 (100.00/100.00)', 'sul1 (100.00/100.00)', 'sul1 (100.00/100.00)', 'sul2 (100.00/100.00)', 'sul3 (100.00/100.00)', 'terB (100.00/100.00)', 'terC (100.00/99.13)', 'terD (100.00/98.96)', 'terE (100.00/99.48)', 'tet(A) (100.00/99.75)', 'tmexC (100.00/99.74)', 'tmexD (100.00/99.90)', 'toprJ1 (100.00/100.00)']"    Klebsiella_pneumoniae   klebsiella  789 Klebsiella_pneumoniae   Klebsiella_pneumoniae_GCF_000240185.1.fna.gz    99.1677 1665    1861    refseq-NZ-573-PRJNA224116-SAMN02777842-GCF_000739495.1-.-Klebsiella_pneumoniae.fna  0.0083078   0   724/1000    Klebsiella_pneumoniae   ['Col(pHAD28) (91.6)', 'Col440I (91.23)', 'IncFIB(pNDM-Mar) (99.32)', 'IncHI1B(pNDM-MAR) (100.0)', 'IncR (100.0)', 'IncX1 (98.4)']  0   1
GCA_904863225.1_KSB1_6J_genomic GCA_904863225.1_KSB1_6J_genomic.fna 4.5.24184   83.44   Multiple FastANI hits,Low core genes,   "[""aac(6')-Ib-cr5 (100.00/100.00)"", ""aph(3'')-Ib (100.00/100.00)"", 'aph(6)-Id (100.00/100.00)', 'arsA (100.00/100.00)', 'arsB (100.00/100.00)', 'arsC (100.00/100.00)', 'arsD (100.00/91.67)', 'arsR (100.00/100.00)', 'blaCTX-M-15 (100.00/100.00)', 'blaOXA-1 (100.00/100.00)', 'blaSHV-1 (100.00/100.00)', 'blaTEM-1 (100.00/100.00)', 'catB3 (70.00/100.00)', 'clpK (97.89/99.25)', 'crcB (100.00/100.00)', 'dfrA14 (100.00/100.00)', 'emrD (100.00/99.49)', 'fieF (100.00/100.00)', 'fosA (100.00/99.28)', 'fosA7 (100.00/91.43)', 'hsp20 (100.00/100.00)', 'oqxA (100.00/100.00)', 'oqxB19 (100.00/100.00)', 'pcoA (100.00/100.00)', 'pcoB (100.00/100.00)', 'pcoC (100.00/100.00)', 'pcoD (100.00/99.68)', 'pcoE (100.00/94.44)', 'pcoR (100.00/100.00)', 'pcoS (100.00/99.14)', 'qnrB1 (100.00/100.00)', 'silA (100.00/98.85)', 'silB (100.00/97.91)', 'silC (100.00/100.00)', 'silE (100.00/91.61)', 'silF (100.00/99.15)', 'silP (99.64/94.18)', 'silR (100.00/100.00)', 'silS (100.00/100.00)', 'sul2 (100.00/100.00)', 'tet(A) (100.00/100.00)']"   Klebsiella_pneumoniae   klebsiella  323 Klebsiella_pneumoniae   Klebsiella_pneumoniae_GCF_000240185.1.fna.gz    99.0536 1616    1825    refseq-NZ-573-PRJNA224116-SAMEA2602936-NZ_CCGN-.-Klebsiella_pneumoniae.fna  0.000434439 0   982/1000    Klebsiella_pneumoniae   ['Col(pHAD28) (100.0)', 'IncFIB(K) (98.93)', 'IncFII(K) (95.95)']   0   1

There is also a newick file generated with iqtree2:

(GCA_020808985.1_ASM2080898v1_genomic:0.0038882302,(((GCA_013783245.1_ASM1378324v1_genomic:0.0035602662,GCA_026626185.1_ASM2662618v1_genomic:0.0030635049)67.8/75:0.0004092349,Klebsiella_pneumoniae_GCF_000240185.1:0.0032600322)100/100:0.0009262356,Klebsiella_pneumoniae_GCF_022869665.1:0.0046128026)99.6/99:0.0005900806,GCA_904863225.1_KSB1_6J_genomic:0.0039588357);

A SNP matrix generated via SNP dists:

snp-dists 0.8.2,GCA_020808985.1_ASM2080898v1_genomic,GCA_013783245.1_ASM1378324v1_genomic,Klebsiella_pneumoniae_GCF_022869665.1,GCA_026626185.1_ASM2662618v1_genomic,Klebsiella_pneumoniae_GCF_000240185.1,GCA_904863225.1_KSB1_6J_genomic
GCA_020808985.1_ASM2080898v1_genomic,0,26202,26340,24554,25128,24777
GCA_013783245.1_ASM1378324v1_genomic,26202,0,26648,21896,22221,26669
Klebsiella_pneumoniae_GCF_022869665.1,26340,26648,0,26209,26246,26393
GCA_026626[18](https://github.com/UPHL-BioNGS/Grandeur/actions/runs/9766935718/job/26961027547#step:5:19)5.1_ASM2662618v1_genomic,24554,21896,26209,0,20967,25626
Klebsiella_pneumoniae_GCF_000240185.1,25128,22221,26246,20967,0,25609
GCA_904863225.1_KSB1_6J_genomic,24777,26669,26393,25626,25609,0

And more.

More information can be found on our wiki pages https://github.com/UPHL-BioNGS/Grandeur/wiki/Phylogenetic-Analysis, https://github.com/UPHL-BioNGS/Grandeur/wiki/USAGE#fasta-files, and https://github.com/UPHL-BioNGS/Grandeur/wiki/phylogenetic_analysis.

erinyoung commented 2 days ago

Did this work for you?