Closed KorayKirli closed 7 years ago
I think you are exactly correct in your analysis of the situation Koray. Is your idea to have an audit that warns us that we have experiments that appear to be replicates based on shared attributes and then query the submitter as to whether there should be either experiment_relationships or experiment_sets created using the experiments in question? This would be in place of trying to create automatically generated experiment sets of replicates that we discussed the other day (at least for replicates). I think this makes sense.
A useful way to keep track of some of this stuff as we work to get the Rao et al. data in shape as it gets loaded would be the 'notes' field.
From Rao et al 2014: In this paper, we refer to both “technical replicates” and “biological replicates.” Two Hi-C libraries are “technical replicates” if the cells were crosslinked together and identical Hi-C protocols were applied to two aliquots. Two samples are “biological replicates” if the cells were not crosslinked together; more specifically, the underlying cell populations were different due to additional passaging.
From analysis standpoint, technical replicates (usually different sequencing runs from the same sample) are most likely merged at some point (either before or after alignment) but biological replicates are not. It would be good to have that distinction when the data is submitted.
On Wed, Aug 17, 2016 at 11:55 AM, Koray Kırlı notifications@github.com wrote:
From Rao et al 2014: In this paper, we refer to both “technical replicates” and “biological replicates.” Two Hi-C libraries are “technical replicates” if the cells were crosslinked together and identical Hi-C protocols were applied to two aliquots. Two samples are “biological replicates” if the cells were not crosslinked together; more specifically, the underlying cell populations were different due to additional passaging.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hms-dbmi/fourfront/issues/136#issuecomment-240457700, or mute the thread https://github.com/notifications/unsubscribe-auth/AA63bPK35EJXiaGaRfEIeRHt5Ej2eqewks5qgy7bgaJpZM4Jk80U .
So in this situation the only thing that would possibly be different (but probably not in fact) in our current schema would be biosample.protocol_cell_culture.starting_amount. So they consider this 2 different libraries or experiments? Using the rules that we have come up this would ideally be a single experiment with the files generated from both aliquots being added to an associated 'technical replicate' FileSet. I guess the question is did they present specific results for each of the replicates (or would someone)? If so I guess they need to be made into separate experiments and then we are back to the situation of having to have 'technical replicate' experimentSets'.
Our rules from yesterday would say that since there are 2 different libraries they are different experiments, since there are 2 versions of the library generated, they would be 2 biological replicates. They make a different definition, the same biosample, different cell preparations would be biological replicates, the same biosample, the same cell prep, the same protocol, separate executions (2 aliquots) would be technical replicates. Looking at their data it is not possible to distinguish 1)if there was 1 library generated, and used multiple times in sequencer or 2)if there were different aliquots of library generated and used in sequencer. For them both cases are technical replicates, according to rules from yesterday, 1 is biological replicates, 2 is technical replicates.
So basically, there are three types of replicates: Same biosource (everyone calls this biological) Same biosample (Rao et al calls it technical, what does encode do?) Same library (encode calls this technical)
Koray, can you query the encode database to see if you can find different use cases?
One possible choice: Same biosource: biological Same biosample: technical Same library: don't even call them replicates; they are just different files of the same experiment.
Or ignore my suggestion to query the database if it is too much work. We can just show the encode wranglers our nomenclature and ask for their input.
Same library: The files may have different quality metrics and some of them may be dropped. We could call them something like sequencing replicates... (though I think 'technical replicates' is a common term for these things).
On Thu, Aug 18, 2016 at 3:46 PM, Burak Han Alver notifications@github.com wrote:
Or ignore my suggestion to query the database if it is too much work. We can just saw the encode wranglers our nomenclature and their input.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hms-dbmi/fourfront/issues/136#issuecomment-240835016, or mute the thread https://github.com/notifications/unsubscribe-auth/AA63bGkYX_xqS4WtYeR2eyIySKAqS-q7ks5qhLaJgaJpZM4Jk80U .
File quality metrics are different than experiment quality metrics so that Soo's point may not present a problem. I still have a comment on the schema doc that we have not resolved yet that poses whether we need experiment quality metrics at all or if metrics about an experiment can all fall under the 'summary statistics' schema(s)?
If they're not treated as replicates, how are they treated? Would they be considered the same experiment? I guess that would be fine as long as multiple different files / workflow qc results can be associated with a single experiment.
On Thu, Aug 18, 2016 at 4:20 PM, Andy Schroeder notifications@github.com wrote:
File quality metrics are different than experiment quality metrics so that Soo's point may not present a problem. I still have a comment on the schema doc that we have not resolved yet that poses whether we need experiment quality metrics at all or if metrics about an experiment can all fall under the 'summary statistics' schema(s)?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hms-dbmi/fourfront/issues/136#issuecomment-240844266, or mute the thread https://github.com/notifications/unsubscribe-auth/AA63bNAyZB_h9KN5nJjXpQ6W7A5mJRchks5qhL6ZgaJpZM4Jk80U .
I think we can carry the File quality metrics-Summary statistics question to a new issue.
@burakalver The same biosource different biosample has 2 versions: 1) where there are additions like modifications or treatments 2) where the only difference is the preparation date (lets call this biosample batches) so the same biosource should not dictate anything by itself.
If two experiments have 2 biosample batches/or identical biosamples, 1) if the use different Restriction Enzymes, or differences in the HiC related fields, like ligation volume, they can not be replicates anymore. So having different batches of biosamples also does not dictate anything. If they used the same biosample, but differed in the protocols, they are again different experiments with no replicate relations. In this case we can see them under biosamples object page "Experiments using this biosample" (if we create such section in biosamples) 2) if all the other fields in the protocol are the same (except the dates) then these are candidates for being biological replicates. We can audit this after submission and ask user, or flag as an issue for the wranglers. I would not automate this, there can be cases where the differing parameter is not recorder by us at the moment of submission. This should still have flags and taken care of us (contact the submitter to see why he/she thinks they are not biological replicates).
Soo: ENCODE considers even biological replicates to be a part of the same experiment. I prefer to draw the line at same library. Koray: I agree, exactly.
Andy: summary statistics vs. QC metrics. You are probably the right, the line is grey in some cases; it might be better to put all in summary stats.
On file quality metrics vs experiment quality metrics: We can run fastqc on each file, and for everything else merge all "sequence/technical replicates. So this distinction should be easy to manage.
Thinking some more, it could be interesting to double check that some summary stats (e.g. cis/trans, convergent/divergent vs distance) don't have a file dependence, which would basically validate that the different files are indeed the same library. In any case, the distinction between what QC we ran on individual files vs. merged files will be well defined.
After the discussions last week, I think this issue is (replicates) now resolved. I will close this issue when we implement the changes in the schemas and the examples. You might want to carry qc discussion to another issue if it is not concluded.
Koray. Sounds good. What changes are needed in the schemas to resolve this? Is it more than changing the enums in the ExperimentSet types and Experiment relation type?
It will be enums, adjusting Rao et al, and also future audits.
Trying to decide on the grouping of experiments from Rao et al., I was a bit confused. Here is a vague proposal based on what we discussed before and what I think. Sorry for the long text.
I think for replicates the definitions should be "a repetition that aims at keeping all controllable and observable parameters the same with the repeated experiment".
A Hi-C experiment (starts with culturing cells and ends with fastq files from a sequencing machine) that is repeated with a different restriction enzyme, or a different sequencer, expected to have differences based on this parameters. If not, it becomes an experiment to test the hypothesis that this parameter is not effective on the observed result. A scientist might treat this two experiments as replicates in the analysis pipeline, however I think we should consider them as separate experiments and have audits in place to check these. In the experiment set we can offer appropriate grouping for experiments to be analyzed together.
When it comes to distinction of Biological and Technical replicates. Looking at HiC community, the prepared DNA library is regarded as the ultimate biological material and repetitions in protocol steps before this sample is obtained creates a Biological replicate. Use of this sample in multiple flowcells of a sequencer produces Technical replicates.
If we had 3 experiments with all fields being the same, except the submission dates, How would we know that they are replicates or not,(or what kind of replicates).
Before I said that we took the Libary>Biosample>Source structure of Encode and turned it into Biosample>Biosource but now I take it back. We turned it into Experiment>BIosample>BIosource. Parameters that lead to a unique material also defined outside of the biosample, like enzyme, ligation time etc... I think we can simplify it like this, every experiment is a discrete entity unless stated otherwise. So if all the fields are the same between 3 experiments, instead of trying to judge if a different batch of cell are used or not, we can simply have a flag, audit... that asks the relation when all fields are the same (except submission date). There might be parameters that are not recorded by us, but are different between the experiments.
What is the information we miss? If they are technical replicates, we know that they are the same biological material analyzed twice. If they are biological replicates, we are not certain which steps are repeated. We have a cell culture date, but if it is spleens from two mice twins, we do not have this information.
Why do we want this information? At the end the important information will be "the set of experiments that are analyzed together in the workflow".