Closed cbergman closed 8 years ago
Any news on the sequences?
Yup, good news. Sequences came in a little before the holidays, but I haven't had a chance to look at them till today. It looks like the samples were run over 4 lanes and thus there are 4*5 pairs of fastq files. I've written an initial script to QC, map and call SNPs for these samples here: https://github.com/brembslab/cs_buri/blob/master/src/run_cs_buri_qc_map_vcf.pl. I've just launched this and will post an update when the data flow through the cluster. Data processing thus far has been:
cp -rp /path/to/ngsdata/Analysis/2014/hiseq/141111_SN700511R_0263_BC5R1AACXX_analysis/CaseyBergman/fastqs/ /path/to/my/scratch/cs_buri
perl cs_buri/src/run_cs_buri_qc_map_vcf.pl /path/to/my/scratch/cs_buri/
Run 1 failed because of no explicit path to $home. Added $home variable to https://github.com/brembslab/cs_buri/blob/master/src/run_cs_buri_qc_map_vcf.pl and relaunched. Invoked run 2 as:
perl $HOME/cs_buri/src/run_cs_buri_qc_map_vcf.pl $HOME $HOME/scratch/cs_buri
Run 2 worked. Data look good. FastQC "per base pair quality" metrics for all fastq read files is PASS:
grep "Per base sequence quality" */summary.txt
BS_CGTACTAG-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L001_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L001_R2_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L002_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L002_R2_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L003_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L003_R2_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L004_R1_001.fastq.gz
BS_CGTACTAG-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS Per base sequence quality BS_CGTACTAG-TAGATCGC_L004_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L001_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L001_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L002_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L002_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L003_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L003_R2_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L004_R1_001.fastq.gz
HS_TCCTGAGC-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS Per base sequence quality HS_TCCTGAGC-TAGATCGC_L004_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L001_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L001_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L002_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L002_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L003_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L003_R2_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L004_R1_001.fastq.gz
JC_AGGCAGAA-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS Per base sequence quality JC_AGGCAGAA-TAGATCGC_L004_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L001_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L001_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L002_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L002_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L003_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L003_R2_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L004_R1_001.fastq.gz
TP_TAGGCATG-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS Per base sequence quality TP_TAGGCATG-TAGATCGC_L004_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L001_R1_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L001_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L001_R2_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L001_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L002_R1_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L002_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L002_R2_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L002_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L003_R1_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L003_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L003_R2_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L003_R2_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L004_R1_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L004_R1_001.fastq.gz
TZ_GGACTCCT-TAGATCGC_L004_R2_001_fastqc/summary.txt:PASS Per base sequence quality TZ_GGACTCCT-TAGATCGC_L004_R2_001.fastq.gz
I will go ahead and submit these data to ENA now.
Preliminarily, it looks like there are substantial genomic differences between these wild-type CS strains, as shown in the following genome browser screen shot:
This figure was made by generating variant call format (VCF) files for each of the lines (VCF files are here: http://bergmanlab.smith.man.ac.uk/data/genomes/), then making a UCSC session with custom tracks (http://genome-euro.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=caseybergman&hgS_otherUserSessionName=cs_buri_v1). Variants are called vs the reference genome, so if all lines share differences with the reference it will appear as a tick mark in all genomes, which gives the false visual impression of difference when all lines are in fact the same. This can be improved in later iterations.
VCF files were generated and transferred as follows:
for i in `ls *vcf`; do j=`basename $i | cut -d"_" -f1`; sed 's:/mnt/fls01-home01/xxxxx:/$HOME:' $i > $j.vcf; done
for i in `ls *vcf | grep -v _`; do bgzip $i; done
for i in `ls *vcf.gz`; do tabix -p vcf $i; done
scp *vcf.gz* xxxx@bergmanlab.ls.manchester.ac.uk:/data/genomes/
UCSC track definitions were generated as follows:
for i in `ls *vcf.gz`; do printf "track type=vcfTabix name=\"$i\" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/$i\n"; done
If this session expires, tracks can be re-loaded using these track definitions:
track type=vcfTabix name="BS.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/BS.vcf.gz
track type=vcfTabix name="HS.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/HS.vcf.gz
track type=vcfTabix name="JC.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/JC.vcf.gz
track type=vcfTabix name="TP.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/TP.vcf.gz
track type=vcfTabix name="TZ.vcf.gz" bigDataUrl=http://bergmanlab.smith.man.ac.uk/data/genomes/TZ.vcf.gz
@brembs - can you verify that the information in this sample sheet is correct: https://github.com/brembslab/cs_buri/blob/master/data/genomics/cs_buri_ena_sample_sheet.tsv? The headers are standard headers needed for the ENA submission. It's the provenance and nomenclature that I'd like you to double check.
Everything is factually correct, but I'm not sure if we shouldn't use the names of the people in the acronyms, rather than those from where we can trace back the stock?
OK, how about e.g. for CS_TP something like "...(source: Tim Tully Lab via Thomas Préat)"
Perfect!
Updated, can you double check https://github.com/brembslab/cs_buri/blob/master/data/genomics/cs_buri_ena_sample_sheet.tsv again?
Could not find any mistakes.
Excellent. I'll use this file for the ENA submission then!
Yes!
I summarized the mapping and fastQC results in a table here: https://github.com/brembslab/cs_buri/blob/master/data/genomics/cs_buri_qc_map.tsv. File was generated as follows:
perl $HOME/cs_buri/src/summarize_cs_buri_qc_map.pl $HOME $HOME/scratch/cs_buri > cs_buri_qc_map.tsv
This analysis revealed that CS_HS is likely infected with Wolbachia, but that this is the only strain that is likely to be infected. So Wolbachia infection may explain HS difference from other CS strains, but it will not explain clustering of strains as seen http://f1000research.com/articles/3-176/v1.
This analysis also revealed that the raw sequence data contain up to 10% of reads with Nextera adapter sequences in them (see discussion here: http://seqanswers.com/forums/showthread.php?t=49702). This may explain the relatively low mapping rate (75-85% of reads are mapped). My preference is to submit the data as it is, report this in the paper and trim/filter the data in future analyses. Reads with adapters can easily be identified, but trimming and filtering prior to submission prevents others from processing the data in alternative ways in the future. Are you OK with this plan?
I'm absolutely ok with your plan!
I'll let Henrike know about her infection...
We need to input some metadata for the ENA submission about the project. I suggest the following:
I'd suggest to use sub-strain instead f strain to avoid confusion, as they're all called Canton-S in the literature. There shouldn't be, in principle, different Canton S strains, but there can be different Canton S sub-strains.
What do you think?
Agreed. This makes nomenclature consistent with Colomb and Brembs (2014) as well.
How about this:
Data are submitted to EBI using metadata in https://github.com/brembslab/cs_buri/issues/2#issuecomment-71207198. Final sample and fastq .tsv sheets are in https://github.com/brembslab/cs_buri/tree/master/data/genomics. Script to generate fastq .tsv file is here: https://github.com/brembslab/cs_buri/blob/master/src/submit_cs_buri_md5_csv.pl. Invoked as follows:
perl /path/to/cs_buri/src/submit_cs_buri_md5_csv.pl path/to/ngsdata//Analysis/2014/hiseq/141111_SN700511R_0263_BC5R1AACXX_analysis/CaseyBergman/fastqs ~/scratch/cs_buri/
The submitted reads to ENA FTP site as follows:
cd ~/scratch/cs_buri/
wget --no-check-certificate https://raw.githubusercontent.com/bergmanlab/blogscripts/master/ena_submit.exp
expect ena_submit.exp cs_buri.fofn $USER $PASS
Confirmation from ENA about accession numbers is as follows:
Once public, the study will be available at: http://www.ebi.ac.uk/ena/data/view/PRJEB8324 usually within 24 hours.
Accession Summary
---------------
Study accession Study unique name
PRJEB8324 ena-STUDY-UOM-27-01-2015-15:28:30:248-1010
Sample accession Secondary accession Sample unique name
ERS646564 SAMEA3220746 CS_TP
ERS646565 SAMEA3220747 CS_TZ
ERS646566 SAMEA3220748 CS_JC
ERS646567 SAMEA3220749 CS_BS
ERS646568 SAMEA3220750 CS_HS
Study accession Sample accession Experiment accession Run accession
PRJEB8324 ERS646565 ERX688215 ERR744533
PRJEB8324 ERS646565 ERX688216 ERR744534
PRJEB8324 ERS646565 ERX688217 ERR744535
PRJEB8324 ERS646565 ERX688218 ERR744536
PRJEB8324 ERS646564 ERX688219 ERR744537
PRJEB8324 ERS646564 ERX688220 ERR744538
PRJEB8324 ERS646564 ERX688221 ERR744539
PRJEB8324 ERS646564 ERX688222 ERR744540
PRJEB8324 ERS646566 ERX688223 ERR744541
PRJEB8324 ERS646566 ERX688224 ERR744542
PRJEB8324 ERS646566 ERX688225 ERR744543
PRJEB8324 ERS646566 ERX688226 ERR744544
PRJEB8324 ERS646568 ERX688227 ERR744545
PRJEB8324 ERS646568 ERX688228 ERR744546
PRJEB8324 ERS646568 ERX688229 ERR744547
PRJEB8324 ERS646568 ERX688230 ERR744548
PRJEB8324 ERS646567 ERX688231 ERR744549
PRJEB8324 ERS646567 ERX688232 ERR744550
PRJEB8324 ERS646567 ERX688233 ERR744551
PRJEB8324 ERS646567 ERX688234 ERR744552
To test whether groups of CS strains exhibiting behavioural differences have a corresponding genetic basis, sequence the genomes of the 5 sub-strains of CS in Colomb & Brembs (2014): http://f1000research.com/articles/3-176/v1