biospectrabysequencing / gbs_moa

Workflow for GBS in moa template format
Other
3 stars 5 forks source link

Quality control of BBS data #16

Open mdavy86 opened 9 years ago

mdavy86 commented 9 years ago

This is a placeholder to discuss what we are doing in terms of Quality control of BBS data

Plant and Food Research

We have some perl scripts, knitr Rmarkdown scripts, and a shiny application looking at quality control aspects of GBS restriction sites for bam alignments.

The shiny application does some exploratory analysis summarizing 96 wells * 2 bam files for ~1.5 Million restriction sites/tags in real time checking the sampled yield distributions versus the known population of restriction sites for samples, investigatng coverage depth, and fragment distribution before considering SNP discovery.

The perl script sanity checks restriction fragments (probably unnecessary), and summarises sites in the following form;

$ perl gbsSites.pl
NAME
    gbsSites.pl - BAM to location terminal ends

DESCRIPTION
    Process a bam file for GBS restriction sites

SYNOPSIS
     gbsSites.pl [options]

    Where options and [defaults] are:

     -bam <BAM file>    Path to a bam file. Multiple options allowed      []

     -enzyme <Enzyme name> Which restriction enzyme? BamHI, ApeKI etc     [BamHI]

     -format < narrow|wide > Options: 'wide' or 'narrow' formats          [wide]

     -out <output file> Filename for tab delimited report                 [report.txt]

## Example output
Sample  Chromosome      cutSite Count   fwdCount        revCompCount
[BAMFile]   1       8312    1       0       1
[BAMFile]   1       17201   340     340     0
[BAMFile]   1       33026   2       0       2
[BAMFile]   1       35031   1       1       0
[BAMFile]   1       50458   54      0       54
rbrauning commented 9 years ago

To enable biologists and lab staff to contribute to qc efforts I've put together questions of interest to be asked from a GBS run. Technical details are left out to draw non-bifos in.

  1. Fastq
    • Did we get per lane what’s promised in terms of output?
    • How does the sequence quality look like?
    • How pure is the data (adapters, other species)? What are contaminants?
  2. Barcodes
    • How many reads have recognizable barcodes?
    • What are the reads without barcodes?
    • Are all barcodes represented equally?
    • Are negative controls blank?
  3. Mapping
    • How many reads can get mapped to a reference?
    • What does the mapping quality look like?
    • How much of the genome gets covered by reads?
    • What does the coverage depth distribution look like?
    • What does the theoretical fragment size distribution look like? Contrast to observed fragment size distribution.
    • How many reads do we see per fragment? Are there fragments that absorb most of the reads?
    • Do the reads map within 100bp of the fragment ends?
    • How do the start and end sequences of fragments look like theoretically and what gets observed?
  4. SNPs
    • How many SNPs do we see per sample?
    • Do GBS SNP calls agree with SNP chip data / WGS data?
mdavy86 commented 9 years ago

Thats good, many of the questions cover more detail than in the last meeting minutes.

We have some code investigating post aligning QC, fragment distributions, modeled as an exponential decay (where applicable), size selection bias relative to the population of known tag sites, depth distribution, reml mixed model analysis of 96 technical samples for 6 genotypes.

lranjard commented 9 years ago

Link to fastq_screen, that utility that subsample reads in fastq files to check for contaminations against a configurable set of Bowtie2 genome indexes: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/