caporaso-lab / mockrobiota

A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.
http://mockrobiota.caporasolab.us
BSD 3-Clause "New" or "Revised" License
77 stars 35 forks source link

phenotype data #27

Closed naarkhoo closed 8 years ago

naarkhoo commented 8 years ago

Is there any mock data where you have measured some metabolites or phenotypes for bench-marking supervised machine learning methods ?

nbokulich commented 8 years ago

No, we do not have such mock communities currently.

On Wed, Jul 6, 2016 at 2:18 PM, naarkhoo notifications@github.com wrote:

Is there any mock data where you have measured some metabolites or phenotypes for bench-marking supervised machine learning methods ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/caporaso-lab/mockrobiota/issues/27, or mute the thread https://github.com/notifications/unsubscribe/AB0bbF_nbSrvjhPaToqxPUlYZRYSwBt4ks5qS50tgaJpZM4JGBcN .

mdeleeuw commented 8 years ago

Hello,

I see there is work going on to reassess the mockrobiota datasets, so this is maybe a good time to share a few notes I made over the last two weeks whilst working with datasets not having outstanding Github issues. I'm available for additional explanations if needed.

Marcel de Leeuw GeneCreek

mock-6/Turnbaugh1 1) Sample Even3 is not found in the single end read files. But it looks like split_libraries_fastq.py bailed out before processing the whole file, even though it reported statistics. Are the datafiles a (failed) concatenation of two lanes with separate libraries?

File "/gcaws/python/2.7.11/lib/python2.7/site-packages/qiime/parse.py", line 37, in is_casava_v180_or_later "Non-header line passed as input. Header must start with '@'."

File "/gcaws/python/2.7.11/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastqseqid) skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: HWI-ST753_83:8:1101:1908:2030#0/1. This may be because you passed an incorrect value for phred_offset.

mock-11/L18S-1 1) reported to the group that metadata is potentially incorrect https://github.com/caporaso-lab/mockrobiota/issues/34 https://github.com/caporaso-lab/mockrobiota/issues/34 2) SeqPrep is not happy with the index read fastq headers, similar to https://groups.google.com/forum/#!msg/qiime-forum/z3DhLeO8ZyA/YVrXojcJml4J https://groups.google.com/forum/#!msg/qiime-forum/z3DhLeO8ZyA/YVrXojcJml4J we needed to apply a

zcat s_8_2_sequences.fastq.gz | sed 's|#0/2$|#0/1|g' | gzip -c - > s_8_2_sequences_corrected.fastq.gz

mock-9/RDBW 1) the forward reads are not in fastq format, the reverse reads are process_iseq.py -i RDBW_fwd.txt.gz -o . --barcode_length 12 --barcode_in_header && gzip RDBW_fwd.fastq 2) there is a (resulting?) difference in the fastq headers between forward/barcodes (^@@ILLUMINA) and reverse (^@ILLUMINA) zcat RDBW_rev.txt.gz | awk '/^@ILLUMINA/{print"@"$0;getline;print;print "+";getline;getline;print}' | gzip -c - > RDBW_rev.fastq.gz 3) The forward reads start 55bp upstream of the ITS sequences in Unite. As a consequence, forward reads fail closed reference picking and are underrepresented in taxonomic assignments for open reference and de novo picking 4) the original publication seems to state correctly the samples are composed of 12 strains, as reflected by the github metadata, whereas the mockrobiota preprints paper mentions 16

On 30 Aug 2016, at 19:20, Greg Caporaso notifications@github.com wrote:

Closed #27 https://github.com/caporaso-lab/mockrobiota/issues/27.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/caporaso-lab/mockrobiota/issues/27#event-772486115, or mute the thread https://github.com/notifications/unsubscribe-auth/AHVL1wEtMxPjC3OYR7dkbgZJphBYbx9Tks5qlGZugaJpZM4JGBcN.

jairideout commented 8 years ago

cc @nbokulich

nbokulich commented 8 years ago

Thank you @mdeleeuw for reporting these issues. We have corrected these files and new raw data links are available. In addition, to respond to your comments on mock-9: 3) Thank you for noting this issue. We link to the raw files in their rawest forms for the most flexible possible use. It is up to users to modify these data to meet the requirements of their software, reference databases, and other needs, which should include primer trimming, etc. We have made a note of this issue and quote your observations on the README.md page for the mock-9 dataset. 4) This mock community contains 12 species but 16 strains. The 16 strains are listed in the “source.tsv” file, as well as the original publication.

mdeleeuw commented 7 years ago

Thank you for the clarifications. I doubt trimming 55bp off from the 101nt forward reads will make them useful with any pipeline though, so the only option would be to find and use an ITS database containing the missing 55nt upstream of ITS1. So I'm using the reverse reads rather as single-end. This requires of course to change the fastq headers in the index reads, which is needed anyway because there are two additional issues with the mock-9 index reads

1) index reads lack the barcode sequence in the fastq header, these must added be prior to processing with split_libraries_fastq.py (same issue found with mock-5) 2) also, the fake qualities entered for the index read are phred-33 whereas the sequence qualities are phred-64 correction for both problems can be achieved with:

gunzip -c mock-index-read.fastq.gz | paste - - - - | awk '{print $1"#"$2"/2" ; print $2"\n+\nYYYYYYYYYYYY"}' | gzip -c > mock-index-read.corrected.fastq.gz

Note that I am using /2 here to make the index reads fit for use with the reverse reads. For mock-10, it seems a different sequencing primer was used. The command line above needs to be changed to

gunzip -c mock-index-read.fastq.gz | paste - - - - | awk '{print $1"#"$2"/1" ; print $2"\n+\nYYYYYYYYYYYY"}' | gzip -c > mock-index-read.corrected.fastq.gz

in order to prep the index reads for split_libraries_fastq.py of the forward reads.