ModellingWebLab / WebLab

Django-based front-end for the modelling Web Lab v2
Other
3 stars 2 forks source link

Create ExperimentalDataset app with basic model & views #131

Open jonc125 opened 5 years ago

jonc125 commented 5 years ago

This may still get split into sub-issues...

MichaelClerx commented 5 years ago

Need probably many-to-many links between ExperimentalData and Protocol, and only once a fit is performed do you tie down exactly which were used? @MichaelClerx ?

Not sure what you mean here! Would an ExperimentalData be a single time series? (Or a single matrix, more abstractly)?

jonc125 commented 5 years ago

ExperimentalData is the wet lab analogue of Prediction (currently called Experiment), i.e. a complete dataset with potentially multiple matrices (CSV files). So it's everything you need to compare your prediction against to do a complete fit.

Ideally I'm assuming you'd have one-to-many correspondence between a protocol version and experimental data (since you can repeat the experiment to get new datasets...) but realistically people are going to develop their own protocols to improve the match to the wet lab procedure, or clone existing ones, etc. Hence many-to-many.

Related to this, we need to think about versioning of datasets. Do we want ExperimentalData to have versions in the same way Prediction does (i.e. re-run exactly the same protocol on exactly the same model - results might still differ due to new code version or stochasticity)? Would versions differ only in metadata, or is changing files allowed? Or should new files require a new ExperimentalData object? (With a many-to-many mapping this isn't really a problem, providing you name things sensibly and/or have good search!)

mirams commented 5 years ago

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too. To force re-runs etc. with the refined protocol description to better match how the experiments were done.

But probably need to stop and think for a little while whether we want:

I think I would lean to the second case, so that we can keep a simple one-to-one between ExperimentalData and Protocol not least to make it easy to think about. Lots of ExperimentalData can share the same Protocol, but I think they would only name one.

MichaelClerx commented 5 years ago

ExperimentalData is the wet lab analogue of Prediction (currently called Experiment), i.e. a complete dataset with potentially multiple matrices (CSV files). So it's everything you need to compare your prediction against to do a complete fit.

Would it be a better idea to have some kind of FittingData that's a view of a larger ExperimentalData set? For example, Kylie's data for a single cell contains multiple time series, and we use different ones in different fits. Even overlapping ones should be possible, so maybe we could have a FittingData just be a list of pointers to things inside our new combine data files?

At the workshop we talked about hierarchical data sets with annotations, and that's still something I'd like to implement in the combine format. (In fact, we have the prototype working, and schedule wednesday to work on that further). For example, Kylie's data set would have a whole bunch of meta data, but then there'd be 7 subdirectories that added protocol-specific meta data, and then 9 subdirectories in each that added cell-specific meta data (temperature and capacitance)

It would make sense to me to then have a FittingData or something that points to this massive bulk of data, instead of having lots of copies with their own meta data copies etc. etc.

Related to this, we need to think about versioning of datasets. Do we want ExperimentalData to have versions in the same way Prediction does (i.e. re-run exactly the same protocol on exactly the same model - results might still differ due to new code version or stochasticity)? Would versions differ only in metadata, or is changing files allowed? Or should new files require a new ExperimentalData object? (With a many-to-many mapping this isn't really a problem, providing you name things sensibly and/or have good search!)

Would say every run of an experiment is a new data set! Changing files would be allowed though, if we discover e.g. the pre-processing was wrong?

Related to that, I would love to have links between data (a semantic web of data sets) so that we can say things like "Data B by Gary is a processed version of Data A". I'm not saying we should have the web lab do the processing, just the ability to document these links would be incredibly useful. In many cases we'll even have "Data B, made available by a modeller, is a processed version of Data A, which has a publication but no online data set"

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too.

Not sure I follow this - where would we get this new experimenta data?

But probably need to stop and think for a little while whether we want:

  • one FittingSpec to work with one ExperimentalData (which may include many protocols);
  • or whether we want one FittingSpec to work with multiple ExperimentalDatas that each are linked to one Protocol.

I think I would lean to the second case, so that we can keep a simple one-to-one between ExperimentalData and Protocol not least to make it easy to think about. Lots of ExperimentalData can share the same Protocol, but I think they would only name one.

I think I'd also prefer the second! So 1 protocol = 1 time series, that's the way experimenters use the term so I strongly believe we should stick to that (even if multi-dimensional arrays are cleaner)

jonc125 commented 5 years ago

The way I see it is that a FittingSpec refers to exactly one each of ModelVersion mv, ProtocolVersion pv, and ExperimentalData ed. Running pv on mv with a given parameter set produces a Prediction p (when done as part of a fit, this isn't stored in the WL of course, only the overall result of a fit is). A Prediction may include many outputs, because it's not just doing the time-series simulation, it's doing all the post-processing, and in general it may be the post-processed outputs you compare against post-processed ExperimentalData (though you can also fit to raw data). So a FittingSpec will always also need to state which outputs from Prediction p get compared to which specific series (or other data) from ed. So ExperimentalData could have multiple datasets corresponding to multiple (even different) protocols within it, and the FittingSpec selects those of interest. This also matches the direction SED-ML is taking with linking to data.

Now, it's probably worth distinguishing at this point between what we want to get running for June 3rd and what it'll eventually look like.

  1. The June 3rd case can be simpler, since we're only trying to capture one use case. So it might well have ExperimentalData containing only one dataset for simplicity both of UI and implementation.
  2. Eventually we need to handle all the complexities you've been discussing. So we still need to discuss them now - we don't want to implement something quickly that will need to change completely, but an incremental step on the way is sensible. All the semantic links between data definitely come in the 'future' category (and since we won't have this semantic-level info for June, simplifying what ed contains for then will make life easier).

So 1 protocol = 1 time series, that's the way experimenters use the term so I strongly believe we should stick to that (even if multi-dimensional arrays are cleaner)

But our protocols are not the same as experimental protocols, since they incorporate post-processing, e.g. producing I-V curves.

MichaelClerx commented 5 years ago

A Prediction may include many outputs, because it's not just doing the time-series simulation, it's doing all the post-processing, and in general it may be the post-processed outputs you compare against post-processed ExperimentalData (though you can also fit to raw data).

Can we draw a diagram or write a short outline of this or something? What are we calling the result of running a "voltage protocol" on a cell? What are we calling post-processed data from such a protocol? And what are we calling a grouping of such results? Are you using ExperimentalData to mean an instance, or a class of things?

Sorry I'm getting really confused here

But our protocols are not the same as experimental protocols, since they incorporate post-processing, e.g. producing I-V curves.

That's fine! I'm sure no-one minds referring to a voltage step sequence + post processing method as a "protocol", e.g. an activation protocol, an IV-curve protocol, etc. I just don't like the idea of an "ExperimentalData (which may include many protocols)". Surely that would be a data set or something?

MichaelClerx commented 5 years ago

Looking back it's this bit already:

ExperimentalData is the wet lab analogue of Prediction (currently called Experiment), i.e. a complete dataset with potentially multiple matrices (CSV files). So it's everything you need to compare your prediction against to do a complete fit.

I would call that a set of predictions, compared against a set of experimental results

jonc125 commented 5 years ago

Names are annoying!

System diagram from Scrambler

I thought we'd managed to agree at Harmony last year :( The decision there was to have the following as Web Lab concepts - each are more like classes than instances:

Are these at least the concepts we need, even if the names might need refining?

Then, since there are some commonalities, we have some base classes:

mirams commented 5 years ago

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too.

Not sure I follow this - where would we get this new experimenta data?

No new data, just a new version of the entity containing the data that is linked to the updated protocol. So that anything that used that data can see that it has been updated, and needs to be re-run with the updated protocol!

jonc125 commented 5 years ago

OK, what Gary is on about on the other ticket is calling Result in the above diagram Fit!

I don't think so - Result in the above diagram has nothing to do with fitting, it's the data arising from wet lab experiments.

mirams commented 5 years ago

Oh, yeah, sorry forget that!

jonc125 commented 5 years ago

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too.

Not sure I follow this - where would we get this new experimental data?

No new data, just a new version of the entity containing the data that is linked to the updated protocol. So that anything that used that data can see that it has been updated, and needs to be re-run with the updated protocol!

I was thinking of supporting this kind of thing by having a FittingResult reference the protocol version, model version, dataset, etc. used to create it. So if a new version of any of those is created, you could potentially flag to users that they might want to refit.

jonc125 commented 5 years ago

Decided to go for one ExperimentalDataset links to at most one Protocol for now (possibly zero @mirams @MichaelClerx?). But multiple datasets can link to the same protocol.

MichaelClerx commented 5 years ago

Sounds good!

(And yes possibly zero!)

jonc125 commented 5 years ago

I've updated the issue description with priorities for the workshop. @MichaelClerx @mirams any comments on these, particularly wrt ordering of the last 2 points and what the data compare UX flow should be?

mirams commented 5 years ago

Probably not a huge priority for next week, if the point above (data with predictions) enabled us to include more than one dataset that was linked to the protocol then that would enable the same thing.

Again for now, all available data could pop up in the plot legend, maybe unselected, and then user could click to view each one manually, that would be fine.

MichaelClerx commented 5 years ago

Looks good to me!