dhimmel / fratjuice

Uncovering the microbes of fraternity basements
Creative Commons Zero v1.0 Universal
6 stars 2 forks source link

Extracting species abundance from uBiome FASTQ outputs #9

Open dhimmel opened 6 years ago

dhimmel commented 6 years ago

We now have sequencing data from uBiome as both raw FASTQ sequences and taxonomy-level abundance summaries. We'd like to compute taxonomical abundance from the FASTQ data using an open source pipeline, as part of this repository. I haven't done anything like this before.

@eliesbik mentioned that the programs mothur (mothur/mothur) or Quime could be useful. Here's a comparison blog post. Note that Qiime version 1 (qiime/qiime) will soon be replaced by Qiime2 (qiime2/qiime2). @eliesbik also mentioned that someone familiar with the process at uBiome may be interested in contributing.

There's a ubiome-opensource/microbiome-tools repository by @richardsprague, which contains information on analyzing uBiome data. However, it looks like this project currently focuses on analyzing the taxonomic summaries rather than the raw FASTQ data.

On Biostars, there was a question about processing uBiome FASTQ data, but the answers are inconclusive.

Anyways, we'll use this discussion for coordinating how to process the FASTQ data.

dhimmel commented 6 years ago

For a single sequencing run (ssr_300909.zip in this example), there are multiple FASTQ files:

ssr_300909__R1__L001.fastq.gz
ssr_300909__R1__L002.fastq.gz
ssr_300909__R1__L003.fastq.gz
ssr_300909__R2__L001.fastq.gz
ssr_300909__R2__L003.fastq.gz
ssr_300909__R2__L004.fastq.gz

In ssr_300921.zip there are:

ssr_300921__R1__L002.fastq.gz
ssr_300921__R1__L003.fastq.gz
ssr_300921__R1__L004.fastq.gz
ssr_300921__R2__L001.fastq.gz
ssr_300921__R2__L002.fastq.gz
ssr_300921__R2__L003.fastq.gz
ssr_300921__R2__L004.fastq.gz

So what do the multiple parts of these file names refer to?

My understanding is that R1/R2 refers to the paired end sequencing, such that R1__L002 and R2__L002 contain the respective ends of the same sequences? However, there's not always a compliment (not every R1 has an R2), which confuses me.

The L001 through L004 refer to sequencing lane? Multiple samples are sequenced in the same lane, but the demultiplexing has already been performed, so only sequence for the listed sample are part of the FASTQ file?

dhimmel commented 6 years ago

Additional References