annamtown / gene8940

Repository for GENE 8940 projects
0 stars 0 forks source link

Project Plan #10

Open annamtown opened 6 years ago

annamtown commented 6 years ago

My project will analyze Salmonella enterica subsp. enterica serovar Enteritidis (S. enteritidis) genomes from a variety of international locations. I think the paper given below would be a good study to replicate: https://www.nature.com/articles/ng.3644.pdf In the supplementary materials, they provide the accession numbers for 674 isolates of S. enteritidis. One issue regarding these isolates is that they were collected over 63 years. I'm sure sure how much the DNA degraded for older samples. All of the data was sequenced using Illumina HiSeq 2000. I would prefer to work with recent isolates, but this will also decrease the same size.

Depending on how closely I want to replicate this paper, I could repeat their phylogenetic analyses using HierBAPS (and potentially with BEAST), but analyze other aspects of how the clades could be related aside from location of origin and sequence similarity. I've worked on a small CRISPR project before, so maybe I could analyze the CRISPR loci and compare spacers of each isolate within and between clades?

cbergman commented 6 years ago
cbergman commented 6 years ago
annamtown commented 6 years ago

Supplementary Data: https://www.nature.com/articles/ng.3644#supplementary-information

cbergman commented 6 years ago
annamtown commented 6 years ago

Yes, all samples are paired with two fastq reads

_quast, mummer, and prokka did not run correctly - fixed this issue by removing these commands because they are unneeded for this project

RAxML specifies GTR + I + G model

# run RAxML GTR with + I + G model and 100 bootstrap pseudoreplicate analyses of the alignment data
#-T is number of threads
#-f a is rapid Bootstrap analysis and search for best­scoring ML tree in one program run 
#-G enables the ML­based evolutionary placement algorithm heuristics by specifying a threshold value of 0.1 (10% of branches considered for thorough insertions)
#-I is a posteriori bootstopping analysis
#-x specifies an integer number (random seed) and turn on rapid bootstrapping
#-p specifies a random number seed for the parsimony inferences
#-m is the model
#-n is the name of the new files
#-s is the input file
raxmlHPC-PTHREADS -T 6 -f a -G 0.1 -x 12345 -p 4523 -m GTRGAMMA ­-I autoFC -n enteritidis -s allconsensus.fasta -# 100
cbergman commented 6 years ago

To convert your multiple consensus.fa files: 1) rename each fasta header:

sed "s/>.*/>$i/g" 

2) concatenate into one multi-fasta file

cat /path/to/consensus/*.consenus.fa

3) optional?: convert multi-fasta to phylip using EMBOSS

seqret -sequence fasta::input.fna -outseq phylip::output.phy
annamtown commented 6 years ago

Fasta parsing error, RAxML expects an alignment. the sequence before taxon >ERR338265 : seems to have a different length



Will try to convert to .phy and run RAxML again
annamtown commented 6 years ago

RAxML ran successfully after converting my multi-fasta file to phylip format:

# convert allconsensus.fasta to .phy format
/usr/local/emboss/latest/bin/seqret -sequence fasta::allconsensus.fasta -outseq phylip::allconsensus.phy

Successfully ran RAxML with phylip file:

/usr/local/raxml/latest/raxmlHPC-PTHREADS -T 6 -f a -G 0.1 -x 12345 -p 4523 -m GTRGAMMA ­-I autoFC --no-bfgs -n enteritidis -s allconsensus.phy -# 100

However, I am uncertain if these exact options were used in the analyses completed in the paper. The paper states, "A maximum-likelihood phylogenetic tree was then built from the alignments of the isolates using RAxML (version 7.0.4) with a GTR + I + G model." I read the manual and guessed which parameter to use for the -I and -G flags.

cbergman commented 6 years ago
annamtown commented 6 years ago

I can also post images/PDFs of the trees from FigTree here, if you want me to.

I just submitted a new job for 50 isolates to see if my new wget command works for downloading the isolates from ENA. I will post trees from this job when it is finished. Job did not run correctly. I got a bunch of "core.____" files in my directory. What does this mean? This is the script I used: 4f70f93 I assume this is a core dump, so I increased my thread count from four to six.

cbergman commented 6 years ago
annamtown commented 6 years ago
cbergman commented 6 years ago
annamtown commented 6 years ago

Do the timestamps for the two RAxML files indicate that the job is still running correctly but very slowly? Should I cancel the job and only rerun RAxML with six threads?

cbergman commented 6 years ago
annamtown commented 6 years ago

Looks like only 29 bootstrap trees have been generated:

wc -l RAxML_bootstrap.enteritidis 
29 RAxML_bootstrap.enteritidis
cbergman commented 6 years ago
annamtown commented 6 years ago

Here is my fasttree script: c60d597

I've tried to run it three times, but it does not execute properly. This is in my error file:

/var/spool/uge/the_zcluster/compute-13-20/job_scripts/182208: line 10: 10616 Segmentation fault      (core dumped) FastTree -nt -n 10 < /escratch4/s_11/s_11_Aug_17/project/allconsensus.phy > fasttree_file
cbergman commented 6 years ago
annamtown commented 6 years ago
cbergman commented 6 years ago
annamtown commented 6 years ago
annamtown commented 6 years ago
cbergman commented 6 years ago
annamtown commented 6 years ago

Here are the 10 sample trees:

My 50-sample run executed well.

The outgroup, ERR338264, is the same for both the 10 sample run and then 50 sample run.

annamtown commented 6 years ago

I ran my all sample job this morning and there is still something wrong with a command.

I omitted my curl command to get the meta file from my script. Before I ran the job, I used the curl command via command line so that my file would already be in the directory. I'm assuming that the command to define "i" is the problem because it is not using the ERR#s as the file names for the fastqs.

screen shot 2017-12-01 at 10 29 41 am
cbergman commented 6 years ago
annamtown commented 6 years ago

The "run_accession" files were not generated from my previous jobs. I'm assuming that the

cut -f6 meta.tvs

command is inserting the header of the column in addition to the ERR#s. Will this affect my allconsensus files?

cbergman commented 6 years ago
annamtown commented 6 years ago