ammaraziz / ctgap

Chlamydia trachomatis Genome Assembly Pipeline
3 stars 2 forks source link

add phylogeny output #7

Open ammaraziz opened 9 months ago

ammaraziz commented 9 months ago

Create phylogeny for each sample.

ammaraziz commented 9 months ago

@gokeson this one is for you!

gokeson commented 9 months ago

Put all of your reference sequences in a folder (all have to be fasta)

create the list of input files for k-SNP4

MakeKSNP4infile -indir (folder) -outfile myInfile

find the ideal length of k-mer that will be used for the alignment:

Kchooser4 -in myInfile

run ksnp

kSNP4 -in myInfile -outdir name_of_output_directory -k 13 -core -ML -vcf

Core will give you the shared SNPs between all samples, in addition to all SNPs.

Vcf will output VCFs for your SNPs: what is found where, and what the original base should be and what it is instead.

ML will run a maximum likelihood tree. This is what we want, since maximum parsimony generates a tree based on minimum distances.

The output will be multiple files, but we care about: SNPs_all_matrix (the alignment file), and the ML.tre (the tree file).

After kSNP4, fix the tree using ClonalFrameML to mask recombinant regions:

Why run ClonalFrameML? Masking recombinant regions is customary practice in generating trees, since they bias the way the tree is generated. Therefore, ClonalFrameML will use your tree and sequences to identify heavily-recombinant regions and branches, note the rho/theta ratio, and re-draw your tree.

Use the output of k-SNP4 as the input to ClonalFrameML:

ClonalFrameML tree.SNPs_all.ML.tre SNPs_all_matrix.fasta final_clinical_clonalframe

(first argument is the tree, second argument is the alignment file, the third argument is the prefix to your files)

The output files are listed here: https://github.com/xavierdidelot/clonalframeml/wiki but what we care about is the tree (.newick). Look at the (em.txt) file if you want to know the rho/theta values, branch lengths (based on recombination distance) etc.

Visualization:

FigTree on windows to visualize. It is not necessarily the best, but it’s a very simple GUI

ammaraziz commented 9 months ago

I think we should avoid computational heavy like ClonalFrameML, how long does it take to run?

gokeson commented 9 months ago

I think we should avoid computational heavy like ClonalFrameML, how long does it take to run?

220 minutes for 125 samples to run clonalframeML after already running ksnp for 3 hours for same samples.

This is where gubins may be advantageous. it requires a ref genome but Ithink that is okay. Everyone uses strain D for gubbins so why not

gokeson commented 9 months ago

This is where gubins may be advantageous. it requires a ref genome but Ithink that is okay. Everyone uses strain D for gubbins so why not

And gubbins is really fast. If we stick with our one sample one tree approach, gubbins makes sense. Anyone needing to do comparative analyses for multiple samples can use ksnp+clonalframeML

gokeson commented 9 months ago
  • Use KSNP for generating reference free snps

How to run gubbins my way:

more info at: #running gubbins: https://github.com/nickjcroucher/gubbins/blob/master/docs/gubbins_manual.md

mamba create --name myenvname gubbins

mamba activate gubbins_env

generate_ska_alignment.py --reference seq_X.fa --input input.list --out test.aln

Where input.list is a tab-delimited file with one row per isolate. The first column should be the isolate name, and the subsequent entries on the same row should contain the corresponding sequence data

run_gubbins.py --prefix gubbins_out test.aln --tree-builder raxml

ammaraziz commented 9 months ago

On your 125 sample tree with clonalframeML, can you extract the snps which are flagged as recombinant?

gokeson commented 9 months ago

On your 125 sample tree with clonalframeML, can you extract the snps which are flagged as recombinant?

Nope. I tried but can't. Do we need it though?