Open teatree1212 opened 8 years ago
@teatree1212 Annemarie - do you expect me to comment on that now, or do you plan to add more information to that thread, first?
If you feel that there is information missing, please let me know. I was done with the comment. But ask questions about things so I can clarify.
With regard to items 3 and 4 in @Nuanda 's post, I would suggest reusing existing models to store block/pot data - specifically, repurposing DesignFactor
where we have five "generic" columns that could hold such values. Additionally, since some scoring units represent individual pots while others are aggregates calculated over several replicate measurements, perhaps we need an extra column in PlantScoringUnit
describing what type of unit we're dealing with (single plant/average of several plants/etc.) - I'm thinking along the lines of an enumerative data type. Such a column would also enable us to correctly interpret the five design factor values in the attached DesignFactor
object. What do you think?
@nowakowski Regarding your second idea - it is already proposed by @teatree1212 that a PlantTrial will have this three-way mutex about the 'state of data' (see the first posting). This will indicate whether PSUs connected to a Trial represent single pots each, or patches of plants, with the adequate consequences (see the 'BUT...' remark).
shouldn't there be a replicate_score_reading column in the TraitScores table?would that help
And you are right, I think there are the design factor column names, which were intended to be used for this. so maybe the relationship is already properly defined between the tables?
However, again, maybe we should not make this compulsory, as I think that it would be a pain to apply this to large-scale experiments. Ideally, people should upload their raw data and maybe in the future, people can do some minor statistical analysis in the BIP, where calculation settings are recorded automatically.
@Nuanda With regard to 'state of data' - glad to see we're thinking along the same lines, however adding this column to plant_trials
instead of plant_scoring_units
is tantamount to declaring that, for a given plant trial, all scoring units are of the same type (block/pot or replicate). @teatree1212 - are you sure this requirement will always be fulfilled?
@teatree1212 Re. your previous comment - that column (replicate_score_reading
) was removed in #123 (it was all '1' anyway). Now - we can reintroduce it, but perhaps we should do that on the PSU level? Then, it will be submitted as an additional column in the scoring csv file in the 3rd step of the submission.
Also:
replicate
column be of an integer type? And should it be made compulsory for 'raw data' submissions and should not be required (or even present) for mean data
submissions?Regarding 5: Relating PSU to both PlantLine
and PlantVariety
is somewhat suspect from the data integrity PoV, but I can add the necessary validations to ensure that no ambiguities arise. @teatree1212 - do you mean that a PSU must always be related to either a PlantLine
or a PlantVariety
and that it can never be related to both at the same time?
@nowakowski 'however adding this column to plant_trials instead of plant_scoring_units is tantamount to declaring that, for a given plant trial, all scoring units are of the same type (block/pot or replicate). @teatree1212 - are you sure this requirement will always be fulfilled?' Can you specify which field we are talking about? Are we talking about PSU?
@teatree1212 Sorry for not being precise enough - I meant the column mentioned in item 7 of your previous message ("Yes, these should be mutually exclusive and the character of the data displayed in an additional column." - let's provisionally call it scoring_unit_type
.
@nowakowski i , now I understand. I am not absolutely sure, but experimental trials are normally standardised and have no different features in one single trial. As soon as you have values generated from a pot and a block experiment, these are already two different trials, as one has taken place in a greenhouse and one in the field. I would even think that a trial consists of constant scoring units are of the same type generated in block/pot AND as replicate.
@Nuanda in response to your unnumbered columns regarding adding to the PSU, I totally agree that this should be submitted in the sheet. This is generally the case in experimental field trial spreadsheets. replicate_score_reading is hence a technical replicate I suppose, right? This would hence be something like , which you also see in the RIPR spreadsheets, ( e.g. tocopherol sheet). How can you however make sure that the user doesn't screw up the order? I would even again suggest to make the user create the template .csv after he has selected the traits, where all other compulsory columns are also named as headers. Then the user has to copy and paste the right values beneath the right header before uploading it. But a different strategy would I think confuse the user even more?
2.Okay, sorry that must have been confusing to reconstruct then. Now I unerstand. Yes, I mean Units_of _measurements, But during the submission of a new trait, this should be made compulsory field.
3.Sometimes there may not be a replicate despite it being raw data, so I wouldnt make it compulsory. For mean data, I dont think it needs to be present.
Haven't heard back from experimentalists yet.
I received an email which I want to share with you in this context about the RIPR data, which may also be useful for the development of the API #481 . I sent it to @Nuanda and @nowakowski . It is about the minimum requirements they think are important for a particular dataset, which was generated from a greenhouse experiment. It also shows that many columns of the dataset are not relevant for an outsider and dont need to be uploaded to the database.
But after talking to Wiktor, he is of the opinion, that we should upload all information in the spreadsheets. Maybe this email can be used as basis for raw data minimum requirements of what to visualise on the website. I will meet Wiktor on Wednesday to follow up on several threads.
@teatree1212 In the meantime, before you talk to Wiktor...
.csv file location: Yes, I have discovered it today actually (: looks good! Maybe you could move the "replace_it" to the front and call it "replace_example_sample_A_value_0" . However, I somehow expected it to be in submission step 2, appearing underneath the add new trait descriptor button. Alternatively, it could appear first thing in submission step 3 so people notice it easily. I think it is rather hidden where it is right now.
reintroduction of technical replicates: Technical replicates should Ideally be taken from the same sample, yes. Otherwise it would be a biological replicate. So this type of replicate- option has been removed from BIP? As we see that technical replicates are being used in the RIPR project, we should probably reintroduce it as it will be used in other studies, too.
I think we will need a kind of "2.5 step" - at the beginning of step 3 we will ask the user several questions and then we will generate a CSV template for her/him. And provided the extended complication of the parsing process, we will probably anyway require
the input in the form of what was generated - otherwise, too many different scenarios...
One such question would be "What number of technical replicates were scored?" (1+). Another one could be "Do you submit/register accession numbers for individual units?" (yes/no). Yet another, for raw data only, could be the set of checkboxes for "polytunnel", "rep", "block", "pot" so the user can tick only the ones that s/he has data for (for instance, I noticed OREGIN uses "plot" assignment - we can add that option too). We may also ask if individual rows have assigned varieties/cultivar names OR line names (requiring exactly one to be selected).
This "mini-wizard" will then generate correct template file, with column names that will be easily recognizable to BIP parser.
(Moved year and plot overview discussion, points 1. and 8. in the above thread, to #499 for clarity).
2.5 step: sounds like a good idea.
Plot vs Pot: this is because OREGIN trials are only field trial, wherease the RIPR data we have for now is only from greenhouse trials I think. Therefore, what we could do is ask for trial type somewhere in step 1 and then depending on the specifications present the respective "rep" "block" "plot" for field trial data and "polytunnel" "rep" "block" "pot" for greenhouse trials. There could also be growth chamber experiments and I don't know how they are set up.
Reopening - wrong issue link in @nowakowski's PR. Sorry.
Plot vs Pot (continued) So maybe we could give the user a bit more freedom, and present a set of 5 checkboxes: [ ] polytunnel [ ] rep [ ] block [ ] pot [ ] plot and ask to pick the set of categories that represent their trial setup? So if one chooses options 2,3, and 4, the resulting scoring template CSV will have the three "rep" "block" "plot" column headers, but no "polytunnel" or "plot" column headers?
Then, we will record that set, as "/"-separated string, in PlantTrial.design_factors
column (see #502 for example value in current BIP DB), and record the three values (submitted under those three CSV headers in the uploaded file) in separate DesignFactor objects (one per PSU).
@teatree1212 What do you think?
5 checkboxes sounds good as we have design_factors 1 to _5 and I don't suspect more than 5 elements come up.
I did some research into TrialDesigns and it is unfortunately not that easy to just provide checkboxes, as sometimes a rep can be equal to a plot in statistical terms.. ( see experimentalDesign fields i issue #502) and the design_type determines the statistical_factors, and hence the design_factors within an experiment.
The reason for me being hesitant is the future plans of applying analytics to these datasets. It is important to map the same elements to the corresponding fields in the database, in case the database at some point is supposed to do these automated statistical analysis. I have further contacted Graham King to explain me the current setup of the database, as all elements are int here, but I don't understand how with this setup an automatised analysis could be achieved.
In other experiments the Polytunnel may be a site for a field trial I am trying to find out what is what and how many different possibilities there are and what their overlap is. I think ultimately, the user needs to be able to map their experimental design to the corresponding fields in the database as the columns header itself can be misleading.
I think in general your approach is pretty clever and maybe we just go with it if GK's explanation of the fields does not follow our aims. This would mean that statistics would need to be done manually in the future ( what i mean is for example manually selecting blocks or reps or polytunnels/fields for ANOVA ).This means that raw data cannot automatically be turned into analysed data in the database and is therefore not available for download for the user by default, unless these analysis have been performed on it.
So, after hearing back from the original database-creator,Graham King (GK) I think we should try and stick to the schema and what we insert into the fields as much as possible. The reason is that they are currently developing R -based analysis for raw data; and they were happy to share the code with us once it is done.
I see that in the design_factor _1 to_5 naming convention they keep the header a "design factor" and add the actual experimental design factor information into the fields beneath the headers together with the respective identifier from the dataset: e.g. block_1, row102.,..Looking into the database, this has been more or less consistently followed.
So if we do this, maybe we can follow your tickbox approach, as this is easier for the user, but parse it to the database to store it in the format or rather naming convention it is already done ( "< designFactoName > < designFactorNumber/Identifier >" = block_1).
E.g. "visible text to the user are in doublequotes"
"Please select your appropriate design factors of your trial design"
-the user ticks elements from an experimental design, sorted to be decreasing in size
"[ ] field [ ] polytunnel [ ] rep [ ] block [ ] pot [ ] plot [ ] occasion [ ] rep=block" [ ] ....
... and all the other elements that are available already
Once they are selected, they are used as headers in the 2.5 spreadsheet, and the user can add their identifying numbers/identifiers ( 1 , 102, ... ) to the respective columns and upload the sheet again.
Then, this needs to be parsed in a way that the designFactorNames (at this stage) headers end up in each of the fields in the columns, in combination with the user's submitted numbers/identifiers ( 1, 102,..). And the design_factor _1 to _5 being the new headers.
Design_factor _1 to _n of the experiment will then be stored in the Design_Factors
Back to the selected designFactorNames, they should be stored together separated by / in the Plant_Trial.statistical_factors column.
I see one problem creeping up, as terms get added: Ideally in the Plant_Trial.statistical_factors column, the statistical factors should also be sorted to be decreasing in size. With the submission of new designFactorNames, they would by default be added to the bottom I suppose. In case of the tick-box approach- maybe it is possible to drag and drop the new designFactorName between the old ones, which make room for it ( I don't know how hard this is to develop) In case of the 5-field-dropdown approach, we would have to ask the user to select the designFactorNames as decreasing in size. This is a natural thing to do for them, as this is also how they think in terms of their statistical units and how the spreadsheets are set up.
What do you think?
15-items checkbox block looks already a bit intimidating. I think the selector-based approach may suit the user better here. So, we can have it like that:
The rest stays exactly as you described. Is what I propose sensible?
sounds good.
I spotted a little issue here: some of the statistical factors are called "replicate" wherease others are called "rep" . But rep= replicate. I suppose to call these fields as example, you did a SELECT statistical_factors FROM Plant_Trials;
Can you correct the "replicate" 's to "rep"? It is a normal term everyone knows and I can also define rep=replicate somewhere in an explanatory text.
@kammerer Tomasz, what do you think about my proposition two comments before? Anything missing or not possible?
@Nuanda I don't see any obvious obstacles. An idea - in the last select field we may limit user to smaller factor than the one selected before (we may show unavailable options in disabled state).
Have you thought about some way of moderation of submitted factors to make sure that the order makes sense?
Your idea to disable (grey out) factors bigger than the one just selected is very good. This should prevent the user to mix the sizes when using the "select" part of the widget.
No additional validations for the free text part are planned. We simply rely on user reading the instruction text. I'll add the "implementation" task for that soon.
@teatree1212 Have you perhaps been able to write down that explanation for the BIP user which I mentioned in bullet 1) several comments ago? In case you've posted it somewhere already, can you remind me where to look for it?
No i haven't but here it is:
Please select the appropriate statistical design_factors suitable for your experimental setup from the (drop- down?) fields. Start by selecting the highest design_factor first and proceed with the next ones in decreasing size. If your factors are not present in the list provided, please add it manually. While creating new factors, please make sure you comply with the rule to submit them in decreasing size, so with the highest missing factor (e.g. treatment) first, to lowest missing factor (e.g. plot).
@teatree1212 Do you think we can close this thread? Are there any outstanding things here, not listed as separate GH issues, which we should address?
Based on a meeting today, I have the minimum requirements which should be made compulsory in the trial submission: Year, (besides the country, also the) Place name, units, Block/pot number, replicate , line/cultivar/variety + , its genetic status. Optional internal accession fields would be desirable ( which the public doesn't necessarily need to see, but are important for internal analysis). Further, the state the data is in should be specified. maybe you could make 3 tick boxes for 1) arithmetric mean- 2)harmonic mean- 3)raw data.
BUT: All these should only be made compulsory, if the data is raw data. If it is submitted based on a mean, then block/pot number and replicate are unnecessary.
Also, the Plot overview would be desirable. I think though, that this may need to be included in an additional development step (?), have a think: I have attached an image on what it looks like. However, they are always part of the excel spreadsheet they use for the field trial setup ( see screenshot).