biocore / qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
GNU General Public License v2.0
285 stars 268 forks source link

split_libraries_fastq.py issues and requests #531

Closed pturnbaugh closed 10 years ago

pturnbaugh commented 11 years ago
  1. The quality score threshold is stated incorrectly in the help file. It is not the minimum acceptable but actually the maximum unacceptable score. The script only accepts scores > the cutoff, not greater than or equal to.
  2. Where is the option for a barcode in the header? This is the format here and that used in previous generations.
  3. Why can't we merge paired ends anymore?
gregcaporaso commented 11 years ago

Hi @pturnbaugh,

The quality score threshold is stated incorrectly in the help file. It is not the minimum acceptable but actually the maximum unacceptable score. The script only accepts scores the cutoff, not greater than or equal to.

Thanks for pointing this out - I'll check into it.

Update: fixed in #1245. (merged)

Where is the option for a barcode in the header? This is the format here and that used in previous generations.

Do you have headers in barcodes in fastq files? If so, could you post ~10 records worth of data here? We've never supported that format in fastq, but have in some of the other formats (e.g., see here).

Update: fixed in #1236.

Why can't we merge paired ends anymore?

This has never been supported in QIIME, but support likely will be added in QIIME 1.6.0-dev. You can find some discussion of how to do this here and here.

Update: fixed in #1216. (merged)

pturnbaugh commented 11 years ago

Sure, here's an example of our latest output format:

@ILLUMINA:371:D1G7GACXX:5:1101:1471:2066 1:N:0:GCCAATATCTCG GTTAGTTTGAGACATTGAGATTGTAGGCCAATGATGTTTCAGGATGGGGAGAGTGTATGGTAAATTGTATCGGTGGAGTT GCCAGAACCAGAAGTGGTTT + B@BFFFFFGHHHHJJIGHIJJIJHGJJJJJIJJIJIFGIIIIJFGGIJJDG@F8BFBGIJ=FEHJEIIGHHIHCDDDB;A C;ACD=;ABBDBC?>CC3<9 @ILLUMINA:371:D1G7GACXX:5:1101:1426:2075 1:N:0:CAGATCATCCCG AAAGAGCTTGGAAATCCAATGGGGTATCTATTGTTTTTGAAAAATGTACATAGTTTTCATGCCCAGCCTGCAAGCGTCCA ACTTGACCGATCACCGCCAG + @CCFFFFFHHHHHJJJJJEGIJJJ<CGHIJJJJFGIIHFGIGHIIJDGIJJJJB@FHGIIIIJIIIIGEEG9CHEBDDDB EEEDCCDDDBBBBCD@<B<B @ILLUMINA:371:D1G7GACXX:5:1101:1326:2085 1:N:0:TNNNNNNNNNNN GAAAGCCAGCGAGGCCTGGAAGGTGAAACGGGAGACGACGTTGACGAAGGTGACGAGCACGATCNCCATGANNGAGNCGN NNNCCACCNANTCTTCGAAC + CCCFFFFFHGHGDIGHIEGHDIICDHIIGIJJEHEIIGGHACHEEDFFDC>AACBBD6=B#################### #################### @ILLUMINA:371:D1G7GACXX:5:1101:1459:2096 1:N:0:TGACCAATCTCG CTCGAGAGAAACCACGGTCACGTGACCACAAACAGATTGCATTCAAACTTGGACTTGAGGAACATACGCCATTGTACGAG TACAAGGTAGGATATCTCAC + CCCFFFFFGHGHGGGG>HEGIJEGGIGIJIIJIEEHIAGHGGGHIGIJJJJJIIIJIJJJJHHHHHGFFBBCACCCDDDD ??CCACD>@CBCCCDC>ACD @ILLUMINA:371:D1G7GACXX:5:1101:1353:2100 1:N:0:TTGGGNATGTAG CTTATTGTTTCTCTGCAATGGGCTTCTACGCGATTTGACGCCTTGGATGGCGGCGCTCCAGGAGCATCTACGGACATCCT CGGAGTTGCCTCGATCTTCG + =?@DDD>2AFACFIBG@?E?E@@FGHCDFGHB:?DGGGI0DF:BA3==F@EB@@B8=C?02<?C@:4@@()5@A?> @33092(4::@@BBCBBC38 @ILLUMINA:371:D1G7GACXX:5:1101:1385:2104 1:N:0:TGACCAATCTCG GATCGCTTGGGAGATCCGCCACAGCAGATCTTCGCGCAGATTGGGCCGCAGCCGAGCCCAAGAGGTTTCCTCGAAGGCGG TGCGGGCCGCTTTGACCGCA + CBCFFFFFGHHGHJJJJJJJJJJJIIJGIBGGIJJJIJJEHJBHIEIHDDDDDDDDBDDDDDDDD4<@CCCCDB<?9ABD

BBDDDDDDD>>BDACABB9 @ILLUMINA:371:D1G7GACXX:5:1101:1410:2104 1:N:0:CGATGTATCTCG GGCTCCAAAAGAACTTGAATCGTACACGACGATTGAGGGAGAAGGACCGAAGGTCAAAGAGGGTCAGAAAGTAGCGGTCC AGTATTCGGGATGGCTGTGG + @C@FDDFDHGHHHIIIJGIGGI@GHIHIIGIFGH=?FGIHIIGBHDGIJEGBB@CAEDECDDD??ACCCCDADDDBB@BB DDCDC@ACDDBB0<CCBBCD

pturnbaugh commented 11 years ago

I'll look into process_iseq.py, we were previously using split_libraries_illumina.py with the "--barcode_in_header" flag

As for the paired ends, I didn't actually mean assembling them. In split_libraries_illumina.py (qiime 1.3.0) you could input the forward and reverse reads using the "-5" and "-3" flags, and the results would be the 2 pairs joined together. Will this be added back to "split_libraries_fastq.py" in QIIME 1.6?

gregcaporaso commented 11 years ago

I generally run split_libraries_fastq.py twice: once for the 5' reads and once for the 3' reads, and you'll get an output fasta file for each. In the future we'll add stitching of overlapping reads together.

pturnbaugh commented 11 years ago

Okay, process_iseq.py works, but would require us to convert from fastq -> iseq -> fastq. Would it be possible to add support for our fastq format (above), according to our core facility it is the latest Illumina default format.

gregcaporaso commented 11 years ago

We can work on getting that in place, but it definitely won't be in the QIIME 1.6.0 release (we're too close now). Note that the sequencing center should be able to output the barcodes in a separate file - the sequencing centers that we work with are doing that.

the latest Illumina default format

If they would only stop changing their default!!!

pturnbaugh commented 11 years ago

Ok, sounds good!

douginator2000 commented 11 years ago

Definitely interested this topic as well. I'm a developer with the Qiime website and we're starting to see a number of barcoded fastq files come through. Currently I'm attempting to strip the barcodes into a separate file and running those as a pair.

gregcaporaso commented 11 years ago

@douginator2000, is the format that you're seeing the same as the one that @pturnbaugh pasted above? If you're seeing others as well can you paste an example of ~3 fastq records on this thread? Thanks!

gregcaporaso commented 11 years ago

One possible way to handle this, if we are seeing the barcodes in different places in the fastq (header v beginning of reads) would be to have, e.g., a process_barcoded_fastq.py script (I believe@walterst has been working on something like this, which isn't in QIIME), that generates the paired barcode/read fastq files that split_libraries_fastq.py expects. This would be similar to the current process_iseq.py and process_qseq.py, which is how we've handled the different formats in the past.

If the barcodes are currently always in the same place and likely to stay that way (which seems unlikely given Illumina's history of lack of standards in data delivery) then this is over-complicated. But if they're likely to continue to move around then this is a much better solution relative to integrating all of this in the split_libraries_fastq.py interface as there is great benefit in keeping that interface simple and understandable.

bstamps commented 11 years ago

For our part we have no plan to move our barcodes- the plan to have them integrated into the front of the read ala 454 has served us well, only changing the length, and some experiments to deal with the diversity problem in amplicon on the Illumina platform. Is this still in process for the 1.6.0dev cycle or are we looking at 1.7.0? Still in favor of something that can parse the first n bp of a read (According to the mapping file with barcode and linkerprimer) and split the libraries accordingly.

gregcaporaso commented 11 years ago

We have to move this to the 1.7.0-dev cycle, but this is high priority.