Sequence genomes of CS sub-strains in Colomb & Brembs (2014)

cbergman commented 10 years ago

To test whether groups of CS strains exhibiting behavioural differences have a corresponding genetic basis, sequence the genomes of the 5 sub-strains of CS in Colomb & Brembs (2014): http://f1000research.com/articles/3-176/v1

cbergman commented 10 years ago

DNA samples prepped in Brembs Lab using methods detailed here: https://github.com/brembslab/cs_buri/blob/master/doc/methods/dna_prep.md
DNA samples received by Bergman Lab 28-10-14 and submitted for whole genome shotgun sequencing
Aim is to generate ~50x coverage 100bp paired-end short-read data on an Illumina HiSeq
Libraries will be prepped using Illumina Nextera kit.
Estimated date of receipt of sequences 15 Dec 2014

brembs commented 9 years ago

Any news on the sequences?

cbergman commented 9 years ago

Yup, good news. Sequences came in a little before the holidays, but I haven't had a chance to look at them till today. It looks like the samples were run over 4 lanes and thus there are 4*5 pairs of fastq files. I've written an initial script to QC, map and call SNPs for these samples here: https://github.com/brembslab/cs_buri/blob/master/src/run_cs_buri_qc_map_vcf.pl. I've just launched this and will post an update when the data flow through the cluster. Data processing thus far has been:

cp -rp /path/to/ngsdata/Analysis/2014/hiseq/141111_SN700511R_0263_BC5R1AACXX_analysis/CaseyBergman/fastqs/ /path/to/my/scratch/cs_buri
perl cs_buri/src/run_cs_buri_qc_map_vcf.pl /path/to/my/scratch/cs_buri/

cbergman commented 9 years ago

Run 1 failed because of no explicit path to $home. Added $home variable to https://github.com/brembslab/cs_buri/blob/master/src/run_cs_buri_qc_map_vcf.pl and relaunched. Invoked run 2 as:

perl $HOME/cs_buri/src/run_cs_buri_qc_map_vcf.pl $HOME $HOME/scratch/cs_buri

cbergman commented 9 years ago

Run 2 worked. Data look good. FastQC "per base pair quality" metrics for all fastq read files is PASS:

grep "Per base sequence quality" */summary.txt
BS_CGTACTAG-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L001_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L001_R2_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L002_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L002_R2_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L003_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L003_R2_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L004_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   BS_CGTACTAG-TAGATCGC_L004_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L001_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L001_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L002_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L002_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L003_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L003_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L004_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   HS_TCCTGAGC-TAGATCGC_L004_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L001_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L001_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L002_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L002_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L003_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L003_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L004_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   JC_AGGCAGAA-TAGATCGC_L004_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L001_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L001_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L002_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L002_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L003_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L003_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L004_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TP_TAGGCATG-TAGATCGC_L004_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L001_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L001_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L002_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L002_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L003_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L003_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L004_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS    Per base sequence quality   TZ_GGACTCCT-TAGATCGC_L004_R2_001.fastq.gz

I will go ahead and submit these data to ENA now.

cbergman commented 9 years ago

Preliminarily, it looks like there are substantial genomic differences between these wild-type CS strains, as shown in the following genome browser screen shot:

hgt_genome_euro_2878_2256b0

This figure was made by generating variant call format (VCF) files for each of the lines (VCF files are here: http://bergmanlab.smith.man.ac.uk/data/genomes/), then making a UCSC session with custom tracks (http://genome-euro.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=caseybergman&hgS_otherUserSessionName=cs_buri_v1). Variants are called vs the reference genome, so if all lines share differences with the reference it will appear as a tick mark in all genomes, which gives the false visual impression of difference when all lines are in fact the same. This can be improved in later iterations.

VCF files were generated and transferred as follows:

for i in `ls *vcf`; do j=`basename $i | cut -d"_" -f1`; sed 's:/mnt/fls01-home01/xxxxx:/$HOME:' $i > $j.vcf; done
for i in `ls *vcf | grep -v _`; do bgzip $i; done
for i in `ls *vcf.gz`; do tabix -p vcf $i; done
scp *vcf.gz* xxxx@bergmanlab.ls.manchester.ac.uk:/data/genomes/

UCSC track definitions were generated as follows:

for i in `ls *vcf.gz`; do printf "track type=vcfTabix name=\"$i\" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/$i\n"; done

If this session expires, tracks can be re-loaded using these track definitions:

track type=vcfTabix name="BS.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/BS.vcf.gz
track type=vcfTabix name="HS.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/HS.vcf.gz
track type=vcfTabix name="JC.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/JC.vcf.gz
track type=vcfTabix name="TP.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/TP.vcf.gz
track type=vcfTabix name="TZ.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/TZ.vcf.gz

cbergman commented 9 years ago

@brembs - can you verify that the information in this sample sheet is correct: https://github.com/brembslab/cs_buri/blob/master/data/genomics/cs_buri_ena_sample_sheet.tsv? The headers are standard headers needed for the ENA submission. It's the provenance and nomenclature that I'd like you to double check.

brembs commented 9 years ago

Everything is factually correct, but I'm not sure if we shouldn't use the names of the people in the acronyms, rather than those from where we can trace back the stock?

cbergman commented 9 years ago

OK, how about e.g. for CS_TP something like "...(source: Tim Tully Lab via Thomas Préat)"

brembs commented 9 years ago

Perfect!

cbergman commented 9 years ago

Updated, can you double check https://github.com/brembslab/cs_buri/blob/master/data/genomics/cs_buri_ena_sample_sheet.tsv again?

brembs commented 9 years ago

Could not find any mistakes.

cbergman commented 9 years ago

Excellent. I'll use this file for the ENA submission then!

brembs commented 9 years ago

Yes!

cbergman commented 9 years ago

I summarized the mapping and fastQC results in a table here: https://github.com/brembslab/cs_buri/blob/master/data/genomics/cs_buri_qc_map.tsv. File was generated as follows:

perl $HOME/cs_buri/src/summarize_cs_buri_qc_map.pl $HOME $HOME/scratch/cs_buri > cs_buri_qc_map.tsv

This analysis revealed that CS_HS is likely infected with Wolbachia, but that this is the only strain that is likely to be infected. So Wolbachia infection may explain HS difference from other CS strains, but it will not explain clustering of strains as seen http://f1000research.com/articles/3-176/v1.

This analysis also revealed that the raw sequence data contain up to 10% of reads with Nextera adapter sequences in them (see discussion here: http://seqanswers.com/forums/showthread.php?t=49702). This may explain the relatively low mapping rate (75-85% of reads are mapped). My preference is to submit the data as it is, report this in the paper and trim/filter the data in future analyses. Reads with adapters can easily be identified, but trimming and filtering prior to submission prevents others from processing the data in alternative ways in the future. Are you OK with this plan?

brembs commented 9 years ago

I'm absolutely ok with your plan!

brembs commented 9 years ago

I'll let Henrike know about her infection...

cbergman commented 9 years ago

We need to input some metadata for the ENA submission about the project. I suggest the following:

short name for the study
- Drosophila Canton-S genome sequences
short descriptive title for the study
- Drosophila melanogaster Canton-S laboratory strain genome sequencing project
abstract to describe the study in detail
- Whole genome shotgun sequences for multiple Canton-S wild-type laboratory strains of Drosophila melanogaster. Strains selected for sequencing are described in Colomb and Brembs (2014) Sub-strains of Drosophila Canton-S differ markedly in their locomotor behavior. F1000Research 3:176 (doi: 10.12688/f1000research.4263.1).

brembs commented 9 years ago

I'd suggest to use sub-strain instead f strain to avoid confusion, as they're all called Canton-S in the literature. There shouldn't be, in principle, different Canton S strains, but there can be different Canton S sub-strains.

What do you think?

cbergman commented 9 years ago

Agreed. This makes nomenclature consistent with Colomb and Brembs (2014) as well.

cbergman commented 9 years ago

How about this:

Drosophila Canton-S genome sequences
Drosophila melanogaster Canton-S sub-strain genome sequencing project
Whole genome shotgun sequences for multiple sub-strains of the wild-type Canton-S laboratory strain of Drosophila melanogaster. Sub-strains selected for sequencing are described in: Colomb and Brembs (2014) Sub-strains of Drosophila Canton-S differ markedly in their locomotor behavior. F1000Research 3:176 (doi: 10.12688/f1000research.4263.1).

cbergman commented 9 years ago

Data are submitted to EBI using metadata in https://github.com/brembslab/cs_buri/issues/2#issuecomment-71207198. Final sample and fastq .tsv sheets are in https://github.com/brembslab/cs_buri/tree/master/data/genomics. Script to generate fastq .tsv file is here: https://github.com/brembslab/cs_buri/blob/master/src/submit_cs_buri_md5_csv.pl. Invoked as follows:

perl /path/to/cs_buri/src/submit_cs_buri_md5_csv.pl path/to/ngsdata//Analysis/2014/hiseq/141111_SN700511R_0263_BC5R1AACXX_analysis/CaseyBergman/fastqs ~/scratch/cs_buri/

The submitted reads to ENA FTP site as follows:

cd ~/scratch/cs_buri/
wget --no-check-certificate https://raw.githubusercontent.com/bergmanlab/blogscripts/master/ena_submit.exp
expect ena_submit.exp cs_buri.fofn $USER $PASS

Confirmation from ENA about accession numbers is as follows:

Once public, the study will be available at: http://www.ebi.ac.uk/ena/data/view/PRJEB8324 usually within 24 hours.

Accession Summary
---------------
Study accession Study unique name
PRJEB8324 ena-STUDY-UOM-27-01-2015-15:28:30:248-1010

Sample accession Secondary accession Sample unique name
ERS646564 SAMEA3220746 CS_TP
ERS646565 SAMEA3220747 CS_TZ
ERS646566 SAMEA3220748 CS_JC
ERS646567 SAMEA3220749 CS_BS
ERS646568 SAMEA3220750 CS_HS

Study accession Sample accession Experiment accession Run accession
PRJEB8324 ERS646565 ERX688215 ERR744533
PRJEB8324 ERS646565 ERX688216 ERR744534
PRJEB8324 ERS646565 ERX688217 ERR744535
PRJEB8324 ERS646565 ERX688218 ERR744536
PRJEB8324 ERS646564 ERX688219 ERR744537
PRJEB8324 ERS646564 ERX688220 ERR744538
PRJEB8324 ERS646564 ERX688221 ERR744539
PRJEB8324 ERS646564 ERX688222 ERR744540
PRJEB8324 ERS646566 ERX688223 ERR744541
PRJEB8324 ERS646566 ERX688224 ERR744542
PRJEB8324 ERS646566 ERX688225 ERR744543
PRJEB8324 ERS646566 ERX688226 ERR744544
PRJEB8324 ERS646568 ERX688227 ERR744545
PRJEB8324 ERS646568 ERX688228 ERR744546
PRJEB8324 ERS646568 ERX688229 ERR744547
PRJEB8324 ERS646568 ERX688230 ERR744548
PRJEB8324 ERS646567 ERX688231 ERR744549
PRJEB8324 ERS646567 ERX688232 ERR744550
PRJEB8324 ERS646567 ERX688233 ERR744551
PRJEB8324 ERS646567 ERX688234 ERR744552

cbergman commented 9 years ago

deleting QC scratch directory too make space. All data are in ENA and all QC results can be regenerated from scripts detailed above.

brembslab / cs_buri

Sequence genomes of CS sub-strains in Colomb & Brembs (2014) #2