Create ExperimentalDataset app with basic model & views

jonc125 commented 5 years ago

This may still get split into sub-issues...

[x] Figure out what is common with Experiment (cf #130) and put in a mixin class / common 'dataset' app (see also https://github.com/ModellingWebLab/project_issues/wiki/Workshop-notes-2018 and #203)
[x] Need probably many-to-many links between ExperimentalDataset and Protocol, and only once a fit is performed do you tie down exactly which were used? @MichaelClerx ?
Add views to (these are probably almost identical to the entity views):
- [x] User uploads a dataset
- [x] Show list of datasets owned by user
- [x] View a single dataset
[x] Create an ExperimentalDataset model in its own datasets app
- This should not have versions, so any changes mean a new dataset (for now)
- It's just a zipped-up collection of files, with a manifest created by the Web Lab, so it can be downloaded again as a COMBINE Archive. The files should live in their own folder hierarchy next to EXPERIMENT_BASE, with a corresponding config setting giving the root path.
- Probably inherits from UserCreatedModelMixin, VisibilityModelMixin, models.Model and needs a name
- A dataset links to a single Protocol - this should be set on creation, e.g. from a drop-down list
- [x] #202 Only protocols a user is allowed to see should be offered - see how this was done for FittingSpec in #222 for inspiration
[x] Create views to (these are probably almost identical to the entity views):
- [x] User uploads a dataset
- [x] Show list of datasets owned by user
- [x] View a single dataset
  - [x] Basic view
  - [x] datasets:archive view like in experiments/views.py
  - [x] datasets:file_download view like in experiments/views.py
  - [x] Plotting arbitrary CSV files
  - [x] Doing so "sensibly" for files where time isn't the first column, or tabs are used as separators...
  - [x] Read the header row (if present) to determine what columns represent? Eventually we should probably try https://www.papaparse.com rather than our own CSV parser.
- [x] Delete a dataset
[ ] Add ability to compare data with predictions
- [x] On experiment results view, have a drop-down select for any dataset linked to the relevant Protocol
- [ ] Also overlay on comparison plots. This is probably largely template & JS code copying what's done for the single overlay, except that there might be multiple protocols to look at when finding linked data. Search dataset-link in experimentversion_detail.html and experiment.js
  - [ ] It'd be nice to go from a Dataset view to comparing predictions of all (relevant) models. Similar UI flow to comparing predictions from a Model. Eventually use species annotations to filter?
[ ] Add ability to compare data with data
- [ ] Figure out what the UI should be here. Perhaps when viewing a single dataset, there should be a link/button 'Compare with other experimental data associated with this protocol' which then takes you to an equivalent of ExperimentComparisonView

MichaelClerx commented 5 years ago

Need probably many-to-many links between ExperimentalData and Protocol, and only once a fit is performed do you tie down exactly which were used? @MichaelClerx ?

Not sure what you mean here! Would an ExperimentalData be a single time series? (Or a single matrix, more abstractly)?

jonc125 commented 5 years ago

ExperimentalData is the wet lab analogue of Prediction (currently called Experiment), i.e. a complete dataset with potentially multiple matrices (CSV files). So it's everything you need to compare your prediction against to do a complete fit.

Ideally I'm assuming you'd have one-to-many correspondence between a protocol version and experimental data (since you can repeat the experiment to get new datasets...) but realistically people are going to develop their own protocols to improve the match to the wet lab procedure, or clone existing ones, etc. Hence many-to-many.

Related to this, we need to think about versioning of datasets. Do we want ExperimentalData to have versions in the same way Prediction does (i.e. re-run exactly the same protocol on exactly the same model - results might still differ due to new code version or stochasticity)? Would versions differ only in metadata, or is changing files allowed? Or should new files require a new ExperimentalData object? (With a many-to-many mapping this isn't really a problem, providing you name things sensibly and/or have good search!)

mirams commented 5 years ago

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too. To force re-runs etc. with the refined protocol description to better match how the experiments were done.

But probably need to stop and think for a little while whether we want:

one FittingSpec to work with one ExperimentalData (which may include many protocols);
or whether we want one FittingSpec to work with multiple ExperimentalDatas that each are linked to one Protocol.

I think I would lean to the second case, so that we can keep a simple one-to-one between ExperimentalData and Protocol not least to make it easy to think about. Lots of ExperimentalData can share the same Protocol, but I think they would only name one.

MichaelClerx commented 5 years ago

ExperimentalData is the wet lab analogue of Prediction (currently called Experiment), i.e. a complete dataset with potentially multiple matrices (CSV files). So it's everything you need to compare your prediction against to do a complete fit.

Would it be a better idea to have some kind of FittingData that's a view of a larger ExperimentalData set? For example, Kylie's data for a single cell contains multiple time series, and we use different ones in different fits. Even overlapping ones should be possible, so maybe we could have a FittingData just be a list of pointers to things inside our new combine data files?

At the workshop we talked about hierarchical data sets with annotations, and that's still something I'd like to implement in the combine format. (In fact, we have the prototype working, and schedule wednesday to work on that further). For example, Kylie's data set would have a whole bunch of meta data, but then there'd be 7 subdirectories that added protocol-specific meta data, and then 9 subdirectories in each that added cell-specific meta data (temperature and capacitance)

It would make sense to me to then have a FittingData or something that points to this massive bulk of data, instead of having lots of copies with their own meta data copies etc. etc.

Related to this, we need to think about versioning of datasets. Do we want ExperimentalData to have versions in the same way Prediction does (i.e. re-run exactly the same protocol on exactly the same model - results might still differ due to new code version or stochasticity)? Would versions differ only in metadata, or is changing files allowed? Or should new files require a new ExperimentalData object? (With a many-to-many mapping this isn't really a problem, providing you name things sensibly and/or have good search!)

Would say every run of an experiment is a new data set! Changing files would be allowed though, if we discover e.g. the pre-processing was wrong?

Related to that, I would love to have links between data (a semantic web of data sets) so that we can say things like "Data B by Gary is a processed version of Data A". I'm not saying we should have the web lab do the processing, just the ability to document these links would be incredibly useful. In many cases we'll even have "Data B, made available by a modeller, is a processed version of Data A, which has a publication but no online data set"

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too.

Not sure I follow this - where would we get this new experimenta data?

But probably need to stop and think for a little while whether we want:

one FittingSpec to work with one ExperimentalData (which may include many protocols);

or whether we want one FittingSpec to work with multiple ExperimentalDatas that each are linked to one Protocol.

I think I would lean to the second case, so that we can keep a simple one-to-one between ExperimentalData and Protocol not least to make it easy to think about. Lots of ExperimentalData can share the same Protocol, but I think they would only name one.

I think I'd also prefer the second! So 1 protocol = 1 time series, that's the way experimenters use the term so I strongly believe we should stick to that (even if multi-dimensional arrays are cleaner)

jonc125 commented 5 years ago

The way I see it is that a FittingSpec refers to exactly one each of ModelVersion mv, ProtocolVersion pv, and ExperimentalData ed. Running pv on mv with a given parameter set produces a Prediction p (when done as part of a fit, this isn't stored in the WL of course, only the overall result of a fit is). A Prediction may include many outputs, because it's not just doing the time-series simulation, it's doing all the post-processing, and in general it may be the post-processed outputs you compare against post-processed ExperimentalData (though you can also fit to raw data). So a FittingSpec will always also need to state which outputs from Prediction p get compared to which specific series (or other data) from ed. So ExperimentalData could have multiple datasets corresponding to multiple (even different) protocols within it, and the FittingSpec selects those of interest. This also matches the direction SED-ML is taking with linking to data.

Now, it's probably worth distinguishing at this point between what we want to get running for June 3rd and what it'll eventually look like.

The June 3rd case can be simpler, since we're only trying to capture one use case. So it might well have ExperimentalData containing only one dataset for simplicity both of UI and implementation.
Eventually we need to handle all the complexities you've been discussing. So we still need to discuss them now - we don't want to implement something quickly that will need to change completely, but an incremental step on the way is sensible. All the semantic links between data definitely come in the 'future' category (and since we won't have this semantic-level info for June, simplifying what ed contains for then will make life easier).

So 1 protocol = 1 time series, that's the way experimenters use the term so I strongly believe we should stick to that (even if multi-dimensional arrays are cleaner)

But our protocols are not the same as experimental protocols, since they incorporate post-processing, e.g. producing I-V curves.

MichaelClerx commented 5 years ago

A Prediction may include many outputs, because it's not just doing the time-series simulation, it's doing all the post-processing, and in general it may be the post-processed outputs you compare against post-processed ExperimentalData (though you can also fit to raw data).

Can we draw a diagram or write a short outline of this or something? What are we calling the result of running a "voltage protocol" on a cell? What are we calling post-processed data from such a protocol? And what are we calling a grouping of such results? Are you using ExperimentalData to mean an instance, or a class of things?

Sorry I'm getting really confused here

But our protocols are not the same as experimental protocols, since they incorporate post-processing, e.g. producing I-V curves.

That's fine! I'm sure no-one minds referring to a voltage step sequence + post processing method as a "protocol", e.g. an activation protocol, an IV-curve protocol, etc. I just don't like the idea of an "ExperimentalData (which may include many protocols)". Surely that would be a data set or something?

MichaelClerx commented 5 years ago

Looking back it's this bit already:

ExperimentalData is the wet lab analogue of Prediction (currently called Experiment), i.e. a complete dataset with potentially multiple matrices (CSV files). So it's everything you need to compare your prediction against to do a complete fit.

I would call that a set of predictions, compared against a set of experimental results

jonc125 commented 5 years ago

Names are annoying!

System diagram from Scrambler

I thought we'd managed to agree at Harmony last year :( The decision there was to have the following as Web Lab concepts - each are more like classes than instances:

Model (which can have multiple versions in a git repo)
Protocol (with versions in git): a Web Lab protocol for doing forward simulations of models, including post-processing. Ideally this corresponds to what you'd do on a real cell in a lab (but degree of similarity will vary).
Prediction (called Experiment in WL1): the result of running a specific version of a Protocol on a specific version of a Model. (Can also have versions, if you re-run the same combination.)
ExperimentalData (called Result in the diagram above): the full collection of raw data, pre-processed data, post-processed data from doing the wet lab equivalent of a Protocol on a cell; multiple files + metadata. Conceptually therefore you'd only have one Protocol connected with an instance of this class, but for various pragmatic reasons we might want to relax this.
FittingSpec: the pints code (or other spec) for fitting parameters of a Model version to an ExperimentalData representing a particular Protocol version. Includes mapping from 'columns' in the ExperimentalData (within specific files) to 'outputs' from running the simulated Protocol, as well as what parameters to fit, prior distributions, algorithm to use, algorithm parameters, etc.
FittingResult (called Inferred parameters in the diagram): the collection of posterior distributions on model parameters, and possibly other metrics/info arising from the process (e.g. debug info, marginal distributions, stats on Markov chains, ...)

Are these at least the concepts we need, even if the names might need refining?

Then, since there are some commonalities, we have some base classes:

Entity: a Model or a Protocol, basically anything with an author and backed by a git repo. Might also be used for a FittingSpec.
Dataset: an arbitrary collection of files with metadata, e.g. ExperimentalData, Prediction, FittingResult.

mirams commented 5 years ago

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too.

Not sure I follow this - where would we get this new experimenta data?

No new data, just a new version of the entity containing the data that is linked to the updated protocol. So that anything that used that data can see that it has been updated, and needs to be re-run with the updated protocol!

jonc125 commented 5 years ago

OK, what Gary is on about on the other ticket is calling Result in the above diagram Fit!

I don't think so - Result in the above diagram has nothing to do with fitting, it's the data arising from wet lab experiments.

mirams commented 5 years ago

Oh, yeah, sorry forget that!

jonc125 commented 5 years ago

I think an update in the Protocol version that data is linked to should probably create a new version of the ExperimentalData too.

Not sure I follow this - where would we get this new experimental data?

No new data, just a new version of the entity containing the data that is linked to the updated protocol. So that anything that used that data can see that it has been updated, and needs to be re-run with the updated protocol!

I was thinking of supporting this kind of thing by having a FittingResult reference the protocol version, model version, dataset, etc. used to create it. So if a new version of any of those is created, you could potentially flag to users that they might want to refit.

jonc125 commented 5 years ago

Decided to go for one ExperimentalDataset links to at most one Protocol for now (possibly zero @mirams @MichaelClerx?). But multiple datasets can link to the same protocol.

MichaelClerx commented 5 years ago

Sounds good!

(And yes possibly zero!)

jonc125 commented 5 years ago

I've updated the issue description with priorities for the workshop. @MichaelClerx @mirams any comments on these, particularly wrt ordering of the last 2 points and what the data compare UX flow should be?

mirams commented 5 years ago

Probably not a huge priority for next week, if the point above (data with predictions) enabled us to include more than one dataset that was linked to the protocol then that would enable the same thing.

Again for now, all available data could pop up in the plot legend, maybe unselected, and then user could click to view each one manually, that would be fine.

MichaelClerx commented 5 years ago

Looks good to me!

ModellingWebLab / WebLab

Create ExperimentalDataset app with basic model & views #131