caporaso-lab / mockrobiota

A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.
http://mockrobiota.caporasolab.us
BSD 3-Clause "New" or "Revised" License
77 stars 35 forks source link

Demultiplexing Reverse Reads #78

Closed nearinj closed 6 years ago

nearinj commented 6 years ago

I'm working with a few mock communities (mock-8 and mock-9) and was easily able to demultiplex the forward reads using

split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz --rev_comp_mapping_barcodes

I was unable to demultiplex the reverse reads using this command. I have looked around for instructions on how to do this but I haven't been able to find anything.

Do the reverse reads have the same barcodes? Or is there a separate index file that I am missing somewhere?

nbokulich commented 6 years ago

Hi @nearinj thank you for reporting this issue!

The same index files should be used for both forward and reverse reads. I have mostly used only forward reads with these datasets, so am unfamiliar with what could be causing this issue.

Are you receiving an error message when you attempt to demultiplex, or are you just receiving empty outputs? Would you mind posting your error messages and/or outputs here and I can debug from there? Thanks!

nearinj commented 6 years ago

Hey @nbokulich thanks for the reply. I did not receive any errors when I ran the command. The exact command I ran for the mock-8 community was:

split_libraries_fastq.py -i mock-reverse-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz --rev_comp_mapping_barcodes

The log file output is as follows:

Input file paths Mapping filepath: sample-metadata.tsv (md5: 17f0900e2b4d1c102549cefa7afbd7bd) Sequence read filepath: mock-reverse-read.fastq.gz (md5: 2a1f2128790efd62b428779337de8bd1) Barcode read filepath: mock-index-read.fastq.gz (md5: 0apperciate76354045394e815a1e097b79d8957cf)

Quality filter results Total number of input sequences: 78490160 Barcode not in mapping file: 73720706 Read too short after quality truncation: 335101 Count of N characters exceeds limit: 0 Illumina quality digit = 0: 0 Barcode errors exceed max: 4434353

Result summary (after quality filtering) Median sequence length: nan Even3 0 Even2 0 Even1 0

Total number seqs written 0 ---Input file paths Mapping filepath: sample-metadata.tsv (md5: 17f0900e2b4d1c102549cefa7afbd7bd) Sequence read filepath: mock-reverse-read.fastq.gz (md5: 2a1f2128790efd62b428779337de8bd1) Barcode read filepath: mock-index-read.fastq.gz (md5: 076354045394e815a1e097b79d8957cf)

Quality filter results Total number of input sequences: 78490160 Barcode not in mapping file: 73720706 Read too short after quality truncation: 335101 Count of N characters exceeds limit: 0 Illumina quality digit = 0: 0 Barcode errors exceed max: 4434353

Result summary (after quality filtering) Median sequence length: nan Even3 0 Even2 0 Even1 0

Total number seqs written 0apperciate

The rest of the output files are blank. Thanks for the help really appreciate it.

nbokulich commented 6 years ago

Hi @nearinj , Thanks again for reporting this issue — it looks like the issue here is probably just that the length of high-quality reverse sequences is relatively short (i.e., quality drops off after a while in the reverse sequences, rendering the remainder low-quality). The key clue here is the Read too short after quality truncation values provided — several hundred thousand sequences are demultiplexed but are too short to pass the default quality filter (75% of the total read length must be high-quality sequence). You can adjust the -p parameter in split_libraries_fastq.py to fix this; e.g., try running:

split_libraries_fastq.py \
    -i mock-reverse-read.fastq.gz \
    -o split_libraries \
    -m sample-metadata.tsv \
    -b mock-index-read.fastq.gz \
    --rev_comp_mapping_barcodes \
    -p 0.5

You can also fiddle with the -q and -r parameters if you want to try to squeeze out more sequences; see here for more details.

I must admit I have not worked with the reverse reads for most mockrobiota datasets; for many of these, the forward/reverse reads are too short to merge and the forward reads are generally high quality. Others have reported quality issues with some reverse reads in mockrobiota, e.g., here so I think this may be related to that issue — the reverse reads are lower quality than we would like for some datasets, unfortunately!

Please let me know if that solves your question; I do not think this is, e.g., an issue with the mockrobiota data, but I will leave this issue open for now in case you continue to have issues even after adjusting the quality parameters for split_libraries_fastq.py. Thanks!

nearinj commented 6 years ago

After looking into it and fixing the quality it did work, however as you mentioned the reverse reads in a lot of data sets are not very useful.

Are the 100% expected sequences for the mock communities post anywhere? I only see 99% otu's for GG and Silva but I would like to know the exact sequences expected in mock-8 and mock-9 (if they are known that is).

nbokulich commented 6 years ago

Thanks for confirming! I will close this issue.

Are the 100% expected sequences for the mock communities post anywhere?

Yes! The expected sequences (if they exist for any dataset) are located in the source directory, e.g., here are the expected sequences for mock-9. There are not any for mock-8, these files only exist if they were provided by the contributor of that data set.