caporaso-lab / mockrobiota

A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.
http://mockrobiota.caporasolab.us
BSD 3-Clause "New" or "Revised" License
77 stars 35 forks source link

Error preprocessing mock 7 and 8 : Failed qual conversion #76

Closed MathieuCharles closed 7 years ago

MathieuCharles commented 7 years ago

As reported in #57 , I encountered some trouble using mock 7 and 8.

I am using qiime 19.1


split_libraries_fastq.py -i mock7-forward-read.fastq -o split_libraries_M7 -m mock7_sample-metadata.tsv -b mock7-index-read.fastq.gz --rev_comp_mapping_barcodes

Traceback (most recent call last):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 365, in 

    main()

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 344, in main

    for fasta_header, sequence, quality, seq_id in seq_generator:

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file

    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq

    seqid)

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: ILLUMINA_0275:2:1101:1357:1952#ATAGGCGATCNN. This may be because you passed an incorrect value for phred_offset.


split_libraries_fastq.py -i mock8-forward-read.fastq.gz -o split_librariesM8 -m mock8_sample-metadata.tsv -b mock8-index-read.fastq.gz --rev_comp_mapping_barcodes --phred_offset 64

Traceback (most recent call last):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 365, in 

    main()

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 344, in main

    for fasta_header, sequence, quality, seq_id in seq_generator:

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file

    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq

    seqid)

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: ILLUMINA_0258:3:1101:1184:1974#NNNNNNNNNNNN. This may be because you passed an incorrect value for phred_offset.

It seems that the phred quality score is not valid.

The indicated sequence contain "i" quality character, so corresponding to ascii 105, which is out of the classical score scale.

Do you have same trouble ? What can I do to solve this ?

A final question, does the sequence still contains primer sequences ?

nbokulich commented 7 years ago

Thanks @MathieuCharles for noting this issue.

It looks like i is a valid character in the phred64 (old Illumina) scale. The issue instead is : in the barcode qual scores, which is only valid in phred33.

However, we can ignore the barcode qual scores, since these are artificial. The sequencing run used an old Illumina format where barcodes are provided in the header line without phred scores, in lieu of a separate barcode fastq. Hence, artificial barcode fastq files were generated to support processing in qiime, but these were mistakenly made with phred33 artificial scores.

I am updating the mock-7 and mock-8 raw barcode files, but in the meantime you can fix these files yourself with the following command:

gunzip -c barcodes.fastq.gz | sed 's/::::::::::::/KKKKKKKKKKKK/g' | gzip -c > barcodes2.fastq.gz

The fixed files should work in the example commands that you provided.

A final question, does the sequence still contains primer sequences ?

No, the primer sequences do not appear to be present in the sequence reads.