Extend PlantTrial model with the data status enum column

Nuanda commented 8 years ago

Quoting @teatree1212 requirement (from #488):

Further, the state the data is in should be specified. maybe you could make 3 tick boxes for 1) arithmetric mean- 2)harmonic mean- 3)raw data.

Before we go with the implementation, a set of questions to @teatree1212:

What would be a meaningful way (from the POV of users) to call that column? Is 'data_status' ok?
Out of curiosity: system-wise the two "mean" settings do not make any difference, right? The current idea is to have a different submission procedure for 'raw' data only, correct? So, currently, the two distinct 'mean' values (harmonic and arithmetic) are simply for other users (who browse submitted data) to know?

teatree1212 commented 8 years ago

after thinking and asking people about it, I think it is not possible to just have two "mean" boxes to tick from with regard to processed data. I think we will have to have two options: 1) raw data and 2) processed data, from which the user needs to specify how it has been processed. Maybe there is a statistics ontology that we could draw from instead of making the user define another term.

teatree1212 commented 8 years ago

Here, a user uses REML, an adjusted mean based on a random model, with the statistical factors included in the equation in a certain way I suppose.. This is important metadata which probably somehow needs to be included, despite it being processed data.

Do you have any thoughts about this?

Nuanda commented 8 years ago

Hmm, are those 8 different data submissions, the first one being the "raw" data, and all the others being some kind of "processed" data?

Anyway, what you suggest makes sense: let users set a submission either as a "raw data submission" or a "processed data submission", explain it to them what we mean by "processed" data (by giving examples - like in the above screenshot), and also asking the user to give some hint about the nature/status of the submitted non-raw data, e.g. in the "Trial description" field (that, for instance, it's an arithmetic mean).

teatree1212 commented 8 years ago

These (in this case) 8 different sheets of a single excel file is what the experimentalists call a "database"- Ideally the user should just upload the raw data, and in the future perform subsequent analysis in the database. But ultimately it is down to the user and they may upload processed data and there we need a mixture of ontology statistics terms ( arithmetic, geometric mean etc..) and free text, seen in REML. Should those be two fields or can this be integrated into one? My suggestion would be two fields, one connected to the ontology (required field); Processed_data_status -> "adjusted mean random model" one to specify anything else in free text (optional) -> " [analysis_run_number+ ....."

-and maybe for now ignore my previous "sample size" suggestion to further describe mean data in #488 ( point 7), as it is quite complicated to specify when technical and biological replicates are being submitted and some have been manually removed in some instances.

Nuanda commented 8 years ago

Do you have a specific ontology in mind? If the ontology also contains a term for "raw/unprocessed data", then we don't need the switch anymore.

teatree1212 commented 8 years ago

true- I will have a look for you. I did last night but haven't found anything suitable to suggest. I was hoping you would know something.

Nuanda commented 8 years ago

To be honest I've never heard about an ontology for that - though I could easily be wrong since I ceased any research work on ontologies several years ago :).

teatree1212 commented 8 years ago

I thought of a "statistics-ontology", where they have terms like "arithmetic mean"- "standard deviation" ( not relevant to us) etc..

teatree1212 commented 8 years ago

let's keep it like it is without ontology for now.

Nuanda commented 8 years ago

The current version (done but not yet deployed) has a raw/processed switch inside the form of the 1st step of the submission. However, we do not record that fact in the PlantTrial object (which is created after the entire submission is successfully finalized).

Do you think we should record that fact in the database and show it to the users in the PlantTrials data table?

teatree1212 commented 8 years ago

yes, good idea. The people submitting data soon will submit their data in two ways- raw (for the links with the pictures) and processed. It would be good to be able to distinguish between them in that way.

Actually, I think the plant trial name has to be unique. Can we use the API to attach raw and processed data to the same project?

Nuanda commented 8 years ago

Plant Trial name is required to be unique, for new submissions. Project name is not a DB table, it is just a column in the plant_trials table - yes, you can submit many trials for the same project, you have a dropdown selection for existing projects in the 1st step of the trial submission (and you can also type in a new project name).

teatree1212 commented 8 years ago

hmh.. but it still would be the same Project really as the data is originated from the same project.. And of course the same trial. The only difference is the data_status, which makes it raw or processed.

teatree1212 commented 8 years ago

In which table to you accomodate data_status?

Nuanda commented 8 years ago

Yes, the same project. You are able to quickly filter Plant Trials data table, in the browse section, with project name.

The plant_trials table will be extended with data_status. We will also show it in the browse view.

teatree1212 commented 8 years ago

shouldn't this be be somewhere further "down" maybe in Plant_Scoring_Units? I suppose it depends what a "Plant Trial" is defined as.

A Scientist's definition, a Project is for example the overall project, with a name like OREGIN, and a universal purpose or question. Within that Project there can be multiple trials, which will be executed in different places, but all trying to answer the Project question in their own way, by different trials looking at different traits, hence scoring different things. These trials individually generate raw data, but can also externally analyse their data. This leaves us with two ( or more) data types ( depending on the type of analysis) for a single trial.

For an example of projects and trials, have a look in the database with SELECT project_descriptor, plant_trial_name FROM Plant_Trials WHERE project_descriptor ='IMSORB';

1) So I would locate the data_status somewhere in a table further "down". Maybe you could assign a data_status to each Scoring_unit for example- which probably makes querying slow..

2) Or we have to just live with it and introduce a nomenclature for trials, where we add a "_raw" or "_processed" to the unique trial names users submit..

Nuanda commented 8 years ago

I understand we have agreed that a single BIP plant trial submission is for data of the same "state" - so if one has both raw and processed data for the same trial, s/he should perform two distinct trial submissions in BIP.

Now, to let me understand you correctly. What you propose here, is to let the user, for a new plant trial submission, choose and existing (i.e. already submitted) plant trial, instead of creating a new one? So, after the second submission is successful, new set of plant scoring units are added to the already existing plant trial object?

teatree1212 commented 8 years ago

You understand me correctly. I haven't thought about the possibility of people submitting different data "states", even though we were actually looking at this possibility earlier ( see green picture above- it's all from the same trial). I know you have been developing this now, so maybe for now we should just go ahead with it. But see below:

I wasn't considering this earlier, but this may also be important: the data_status information needs to be attached to the single scoring values anyways, because at some point in the development-future people will use perform on-the-fly statistical analysis on the data in the database. There, people may not query by trial but by trait_descriptor , so across multiple trials.

Nuanda commented 8 years ago

Ok, so the conclusion for now is:

keep the flag in the 1st step (as it is important for further steps) of the submission
do not store the flag in the DB table plant_trials
for all plant_trials which will be submitted that way we have a clear migration route: all PSUs related to a "raw" trial are "raw", and all related to "processed" trial are "processed" (so we can migrate data to the more detailed model in the future).

Talking about that future - also quite important question is - are we able to set such data status for all current (of CropStore origin) plant scoring units?

teatree1212 commented 8 years ago

conclusion sounds good!

future- That is a good question. I have to ask Graham King about this. The way people submitted raw data can be seen in #471.

Nuanda commented 8 years ago

Implemented as in 'conclusions'.

TGAC / brassica

Extend PlantTrial model with the data status enum column #518