Pooled data - Githubissues

jdeck88 commented 6 years ago

Discussion online about how best to accept pooled data. Here is an issue to track comments on pooled data.

The solution for accepting pooled data can be either UI changes, changes to existing drop down boxes, new metadata fields, or simply changes in documentation to describe how this should be handled, e.g. add a new section to Help doc entitled "How do i handle Pooled Data"?

jdeck88 commented 6 years ago

Comment From Chris Bird:

All we need to do is accept a demultiplex file that equates DNA sequences to sample ID. We all discussed this a while back, and I remember devising a way to get GeoMe to accept. Here are the categories that we need: Read1 Barcode, Read2 Barcode, Sample ID. These can be added in a file or built into the meta-data. Individuals or individual samples can be linked to the appropriate FASTQ files in the meta-data. I think that all of this is possible without mods, but of course, would be better if it’s baked in.

Chris Meyer and I have gone back and forth a bit on accommodating lightly processed metabarcoding data (aligned read1 and read2, demultiplexed). I think the main outcome was that even for the same data type, sometimes different labs (his and my labs) handle the same type of data differently and we may need to add some flexibility. I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info.

mgaither commented 6 years ago

Indeed but that "sample" needs to be described as well. For instance, to properly describe the sample we need to know how many individuals were pooled and the geographic range over which those individuals were collected. Were there 30 individuals collected across the island of Oahu or 30 individuals collected from the same rock?

cbird808 commented 6 years ago

I think I’m not understanding something

From: Michelle R. Gaither [mailto:notifications@github.com] Sent: Tuesday, October 17, 2017 12:02 PM To: biocodellc/geome-db geome-db@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [biocodellc/geome-db] Pooled data (#9)

Indeed but that "sample" needs to be described as well. For instance, to properly describe the sample we need to know how many individuals were pooled and the geographic range over which those individuals were collected. Were there 30 individuals collected across the island of Oahu or 30 individuals collected from the same rock?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocodellc/geome-db/issues/9#issuecomment-337286046, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMNeS8IYCWOi_43FqppZV8KzCBHNP0_3ks5stN2MgaJpZM4P8Oi8.

mgaither commented 6 years ago

My comment is null and void if each individual is assigned a separate sample ID with its own metadata.

jdeck88 commented 6 years ago

From Michelle Gaither: I think the stickiest issue is the from in which data is uploaded to GeOMe.

Right now GeOMe is requesting demultiplexed fastq files-one for each individual. In the case of pooled ezRAD data that would equate to one fastq file per pool of individuals. Chris has suggested that we have a separate entry (and sample ID) for each individual in the pool with all those sample IDs pointing to the same fastq file. I'm wondering if we need to then link those sample IDs in a way that ensures users understand the pooled nature of the data. So if a user downloads the metadata for one individual in the pool ALL the metadata for that pool is automatically downloaded.

Alternatively, its has been suggested that GeOMe allow for the uploading of raw fastq files. In this case each sample ID would also include barcode information to allow for demultiplexing-with all relevant sample IDs pointing to the same fastq file [this could be the case for traditional (individually barcoded) RADSeq as well] but gets quite complex for pools of pools.

cbird808 commented 6 years ago

Pools of pools would be uncommon for rad

Get Outlook for Androidhttps://aka.ms/ghei36

From: John Deck notifications@github.com Sent: Thursday, October 19, 2017 11:28:51 AM To: biocodellc/geome-db Cc: Bird, Chris; Comment Subject: Re: [biocodellc/geome-db] Pooled data (#9)

From Michelle Gaither: I think the stickiest issue is the from in which data is uploaded to GeOMe.

Right now GeOMe is requesting demultiplexed fastq files-one for each individual. In the case of pooled ezRAD data that would equate to one fastq file per pool of individuals. Chris has suggested that we have a separate entry (and sample ID) for each individual in the pool with all those sample IDs pointing to the same fastq file. I'm wondering if we need to then link those sample IDs in a way that ensures users understand the pooled nature of the data. So if a user downloads the metadata for one individual in the pool ALL the metadata for that pool is automatically downloaded.

Alternatively, its has been suggested that GeOMe allow for the uploading of raw fastq files. In this case each sample ID would also include barcode information to allow for demultiplexing-with all relevant sample IDs pointing to the same fastq file [this could be the case for traditional (individually barcoded) RADSeq as well] but gets quite complex for pools of pools.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/biocodellc/geome-db/issues/9#issuecomment-337963153, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMNeS5yWnLfNGhaLIF2O8opvZ3ujAxsjks5st3jCgaJpZM4P8Oi8.

mgaither commented 6 years ago

I seq 96 individual in a lane. With ezRAD I'm guessing you pool several "pools"...that's what I mean by pools of pools. The most unmolested fastq file would then be from a single lane of sequencing and thus a pool of "pools".

cbird808 commented 6 years ago

With ezrad, there are no barcodes. So every pool has its own fastq or pair of fastq

For ddrad, we have barcoded individuals also. That's not a pool though.

I suppose that somebody out there has barcoded pools in ddRAD, I'll ruminate on that

Get Outlook for Androidhttps://aka.ms/ghei36

From: Michelle R. Gaither notifications@github.com Sent: Thursday, October 19, 2017 2:11:56 PM To: biocodellc/geome-db Cc: Bird, Chris; Comment Subject: Re: [biocodellc/geome-db] Pooled data (#9)

I seq 96 individual in a lane. With ezRAD I'm guessing you pool several "pools"...that's what I mean by pools of pools. The most unmolested fastq file would then be from a single lane of sequencing and thus a pool of "pools".

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/biocodellc/geome-db/issues/9#issuecomment-338008018, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMNeS_gA13fCOmJUjLRSeDVsrgRXcpBwks5st578gaJpZM4P8Oi8.

mgaither commented 6 years ago

"With ezrad, there are no barcodes. So every pool has its own fastq or pair of fastq"

but you said earlier

"I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info."

I call that a pool of pools that hasn't been demultiplexed. I assume when your say raw FASTqs you mean straight off the Illumina with little processing and no demultiplexing.

Personally I'm in favor or demultiplexed files that represent either an individual or a pool of individuals as in ezRAD.

cbird808 commented 6 years ago

We can work this out on Monday.

But I can't help myself 😁

Ezrad data is not demultiplexed, except by the sequencer.

Ddrad data does need demultiplexing.

I support accepting the raw fastq files. Indeed, it can't get easier than that.

Thus, I envision each of the individuals barcoded in a pair of fastq files to have their own Geome entry. Each of these entries point to the pair of raw fastq files and a decode (demultiplex) file.

I'm wary of accepting demultiplexed data because it involves decisions that won't be the same from person to person, it often involves some sort of trimming, quality filtering, pearing, which will remove data.

Get Outlook for Androidhttps://aka.ms/ghei36

From: Michelle R. Gaither notifications@github.com Sent: Thursday, October 19, 2017 7:43:51 PM To: biocodellc/geome-db Cc: Bird, Chris; Comment Subject: Re: [biocodellc/geome-db] Pooled data (#9)

"With ezrad, there are no barcodes. So every pool has its own fastq or pair of fastq" but you said earlier "I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info." I call that a pool of pools that hasn't been demultiplexed. I assume when your say raw FASTqs you mean straight off the Illumina with little processing and no demultiplexing. Personally I'm in favor or demultiplexed files that represent either an individual or a pool of individuals as in ezRAD. — You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/biocodellc/geome-db/issues/9#issuecomment-338015841, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMNeS0CLpOUasi4kewmDBLvNNxewR-3vks5st6Z3gaJpZM4P8Oi8. [Image]

mgaither commented 6 years ago

I'm catching your vibe

Ezrad data is not demultiplexed, except by the sequencer.---The pools are distinguished by illumina indices only (not user designed barcodes) which are parsed by the Illumina software therefore the "raw fastq" is at the level of the pool.

Ddrad data does need demultiplexing----yes. Each Illumina index will also contain individuals with barcodes.

"Thus, I envision each of the individuals barcoded in a pair of fastq files to have their own Geome entry. Each of these entries point to the pair of raw fastq files and a decode (demultiplex) file."

Yep your point is well taken.......we will discuss at meeting.

jdeck88 commented 6 years ago

After a call on Tuesday Oct. 23rd we decided to handle pooled data by adopting count values in the dwc:individualCount field. Any dwc:individualCount > 1 for a single materialSample is pooled. This way, we can handle any pooled data samples. This comment is, for now, closing this particular issue which was just directed towards coming up with a short-term pooled data strategy.

jdeck88 commented 5 years ago

How to deal with pooled-seq data has come up again, and we realized that the synopsis here did not result in an adequate clarification on how to go about represent pooled-seq data. I'm re-opening this thread to come up with a more satisfactory conclusion, in particular a direction about pooled-seq data for our FAQ document at:

https://docs.google.com/document/d/1tEFpclCyJ6aLnypmtdfdjLVhiWQ-rYhGqu5eGhq3s5s/edit#heading=h.9jdy98irwwtj

I'll ping Chris Bird and Michelle Gaither on this again since they were both on original thread.

biocodellc / geome-db

Pooled data #9