TGAC / brassica

Brassica Information Portal
GNU General Public License v3.0
6 stars 4 forks source link

minimum requirements trial submission #488

Open teatree1212 opened 8 years ago

teatree1212 commented 8 years ago

Based on a meeting today, I have the minimum requirements which should be made compulsory in the trial submission: Year, (besides the country, also the) Place name, units, Block/pot number, replicate , line/cultivar/variety + , its genetic status. Optional internal accession fields would be desirable ( which the public doesn't necessarily need to see, but are important for internal analysis). Further, the state the data is in should be specified. maybe you could make 3 tick boxes for 1) arithmetric mean- 2)harmonic mean- 3)raw data.

BUT: All these should only be made compulsory, if the data is raw data. If it is submitted based on a mean, then block/pot number and replicate are unnecessary.

Also, the Plot overview would be desirable. I think though, that this may need to be included in an additional development step (?), have a think: I have attached an image on what it looks like. However, they are always part of the excel spreadsheet they use for the field trial setup ( see screenshot).

screen shot 2016-04-25 at 16 54 15
Nuanda commented 8 years ago

@teatree1212 Annemarie - do you expect me to comment on that now, or do you plan to add more information to that thread, first?

teatree1212 commented 8 years ago

If you feel that there is information missing, please let me know. I was done with the comment. But ask questions about things so I can clarify.

Nuanda commented 8 years ago
  1. Year. See our google docs discussions - one in the Representation of “time” section, and then another one in the rep of “time” alternative bullet below. Please note my comment about PSU's 'date_planted' column (data type: date). Tell me, if we should make any changes regarding that or simply make the Year field required.
  2. Units - I assume you mean "Units of measurements" field in the new Trait Descriptor form.
  3. Block/pot number - this is a whole different issue and we need to discuss it further. For instance, see the DesignFactor model in the current schema - should be reuse it somehow? I think that could be the original intent of the CS creator.
  4. Replicate - it's something that is not represented now in the submission. Should we extend the current PSU model for an additional (optional) column called 'replicate'?
  5. A PSU is related to a PlantLine through a PlantAccession. A PlantLine may be related to a PlantVariety (currently most are not - see #100). PlantLines are created during the PlantPopulation submission process (so, currently, there is no possibility, other than through the API, to manually add PlantLines without adding a PlantPopulation). We need to devise a policy on how the submission system should react to cultivar names that are not in the system. Even further - how to add a PlantVariety (for a unknown, new cultivar name), then add a related PlantLine, then add a PlantAccession and then relate it to the PSU in question? Or, maybe, we should drop some of those tables/relations? Maybe the current CS/BIP model is simply too complex? Or the relations are wrong (and, for instance, a PSU should be directly related to a PlantLine or even a PlantVariety)?
  6. Genetic status - it's a PlantLine column, we may require it as mandatory during a PlantPopulation submission.
  7. Data status - the tick boxes and a corresponding DB column could be added (I assume you mean the three options are mutually exclusive?). However, please also take a look at these PSU columns: number_units_scored, scoring_unit_sample_size, scoring_unit_frame_size - perhaps that could be important for cases of recording mean values?
  8. Plot overview - I saw your comment in the google doc. Is an image upload sufficient? If so, is one image per a PlantTrial enough?
nowakowski commented 8 years ago

With regard to items 3 and 4 in @Nuanda 's post, I would suggest reusing existing models to store block/pot data - specifically, repurposing DesignFactor where we have five "generic" columns that could hold such values. Additionally, since some scoring units represent individual pots while others are aggregates calculated over several replicate measurements, perhaps we need an extra column in PlantScoringUnit describing what type of unit we're dealing with (single plant/average of several plants/etc.) - I'm thinking along the lines of an enumerative data type. Such a column would also enable us to correctly interpret the five design factor values in the attached DesignFactor object. What do you think?

Nuanda commented 8 years ago

@nowakowski Regarding your second idea - it is already proposed by @teatree1212 that a PlantTrial will have this three-way mutex about the 'state of data' (see the first posting). This will indicate whether PSUs connected to a Trial represent single pots each, or patches of plants, with the adequate consequences (see the 'BUT...' remark).

teatree1212 commented 8 years ago

shouldn't there be a replicate_score_reading column in the TraitScores table?would that help

screen shot 2016-05-02 at 21 44 09

And you are right, I think there are the design factor column names, which were intended to be used for this. so maybe the relationship is already properly defined between the tables?

screen shot 2016-05-02 at 21 48 17
teatree1212 commented 8 years ago
  1. I will hear back from experimentalists about this.
  2. correct. Bear in mind that this will also be a component of the trait variable itself (just as reminder, see below the structure for a trait based on cropOntology). But I think it is a column to keep. However, the information probably should be taken from the trait chosen or defined rather than making the user submit it again in the sheet as a separate column. screen shot 2016-04-29 at 15 13 51
  3. like in my previous comment, I agree that this was probably the intention of that column. Therefore, maybe the relationships are already conveniently defined, but I am interested to hear about the feasibility. bear in mind here, that there are 1) plots, but also 2) measurement replicates in case of the submission of raw data, which may need to be accomodated, too. Maybe these design_factor fields would be helpful there, too? I would suggest, we use the "plot" also for greenhouse experimental designs, where the plot would be a "pot". The same principle applies.
  4. Whether it will be dealt with like the "plot" issue in 3. or another column created in PSU table, whatever requires less computational time and power when the content gets displayed.
  5. I agree that this is too complex and as I mentioned in #481, experimentalists think that this is an unnecessary relationship which does not render meaningful information to everyone. a. "A PlantLine may be related to a PlantVariety (currently most are not - see #100)." PlantLines are not necessarily associated with PlantVarieties in real, so them not always being related to a Variety is valid. b. "We need to devise a policy on how the submission system should react to cultivar names that are not in the system. Even further - how to add a PlantVariety (for a unknown, new cultivar name), then add a related PlantLine, then add a PlantAccession and then relate it to the PSU in question?" Option 1: You mentioned the possibility of submitting Cultivar names during trial submission. Option 2: Cultivar sets are submitted like diversity sets ( which actually are a list of cultivars known to have many differing traits, so good for trait scoring). This could just be another option beneath submitting a plant line list in the PlantPopulation Submission process. c. "so, currently, there is no possibility, other than through the API, to manually add PlantLines without adding a PlantPopulation" I think that is fine and can remain like this. However, this could be the reason why there were so many Populations registered in the database, even though the lines were only subsets of a few populations ( see issue #444). It may be of use to upload the big populations with all lines from scratch, kick out all the subpopulations. That way people may not necessarily have to upload new lines as these experimentalists often work with the same lines. I am working on identifying the full lines in crossing/mutagenised populations, and cultivars in diversity foundations and diversity fixed foundations. d. "Or, maybe, we should drop some of those tables/relations?" I suggest to have the PlantAccession as a standalone information to the PSU for each user, connected to their personal scoring data. -And not as a connecting element between PlantLine and PSU. e. "Or the relations are wrong (and, for instance, a PSU should be directly related to a PlantLine or even a PlantVariety)?" A consequence of my last statement is that PSU's would be directly related to PlantLines and PlantVarieties. The relationship between PSU and Lines/varieties in reality can be quite convoluted and i think the CropStore schema accommodated for that by actually relating PSUs to PlantLines and then PlantVarieties. In reality, PSUs, depending on the experiment would be related to either Lines or Varieties. Some Lines will never be associated with a variety but a variety has most likely been derived from a an experimental line. I don't know whether the CropStore database was ever inteded to host data that comes directly from varieties, which is why they probably did not see a direct relationship between PSU and PlantVariety. But in the future and you see it in the example of the RIPR data, plant varieties will need to directly be associated with PSUs ( Alesi is for example a variety name and is part of the mock-sheets I sent to you). Can you remind me, just so my train of thoughts above makes sense, I assumed that the PSUs are the actually measured or mean trait scores, right? They are not another element between the line/variety and the actual measured score like the accession!?
  6. I have to hear back with Experimentalists about this. I don't know how universal the genetic status of lines is, when submitting a PlantPopulation or whether they would normally need to be specified for each line. On that note, when submitting a Population, e.g. 172 single Lines may need to be submitted. This means for now in the wizard submission, that the user needs to manually create all these 172 lines AND add a taxonomy term to them. a few issues arise from there, which I will mention in a separate issue as it may require some work thinking about it.
  7. Yes, these should be mutually exclusive and the character of the data displayed in an additional column. However, if looking into the future, when submitting raw data, maybe averages can be automatically generated by BIP and stored in the database, so they should only be mutually exclusive during submission. I suppose when submitting mean values, one could be asked to add information like number_units_scored, scoring_unit_sample_size, scoring_unit_frame_size for people who want to understand how the mean is generated- However, I do not understand the meaning of the three columns and there is no information about them. An important aspect I think is to get information on how many values are used to create the mean. What makes it tricky here is that sometimes people remove some values so when calculating means, which means there is no single number of samples used to derive a mean trait score. Similarly, the mean may not have been derived from the same amount of measurements. So I guess that Scoring_unit_sample_size is the number of samples used -> the number of biological replicates. and number_units_scored is the amount of measurements performed on a single Scoring_unit -> the number of technical replicates.. per biological replicate..very convoluted.. Then I would identify these as fields that could come up as non-compulsory fields to be filled in when submitting mean trait scoring data. screen shot 2016-05-03 at 10 54 59

    However, again, maybe we should not make this compulsory, as I think that it would be a pain to apply this to large-scale experiments. Ideally, people should upload their raw data and maybe in the future, people can do some minor statistical analysis in the BIP, where calculation settings are recorded automatically.

  8. the plot overview is useful when Plots are made compulsory ( which they should) to see ( in case of raw data), whether any plot-locations skew the results so unnaturally, that they should basically be excluded from subsequent analysis. Good question. I wonder whether: One image per trial per year is enough as plot layouts may differ between years. But maybe a different layout then makes it a new trial alltogether.. This ties in with 1. and I hope I will hear back from them soon.
nowakowski commented 8 years ago

@Nuanda With regard to 'state of data' - glad to see we're thinking along the same lines, however adding this column to plant_trials instead of plant_scoring_units is tantamount to declaring that, for a given plant trial, all scoring units are of the same type (block/pot or replicate). @teatree1212 - are you sure this requirement will always be fulfilled?

Nuanda commented 8 years ago

@teatree1212 Re. your previous comment - that column (replicate_score_reading) was removed in #123 (it was all '1' anyway). Now - we can reintroduce it, but perhaps we should do that on the PSU level? Then, it will be submitted as an additional column in the scoring csv file in the 3rd step of the submission.

Also:

  1. This is some kind of misunderstanding. In the OP you have mentioned "units", as a compulsory field. I tried to clarify what you meant - and I assume, you meant the the "Units of measurements" dual selection/free text box, inside the new Trait Descriptor form, during the 2nd (not the 3rd) step of the submission (the selector is for using units already present in BIP, the free text is for introducing new units). I never suggested an additional trait scoring column for units and I agree the units should be kept in the TraitDescriptor table, as it is now.
  2. Ok, so it seems we have concluded, that we are going to keep the replicate information either inside the PSU record, or inside the (optional) DesignFactor record related to the PSU, and not inside individual TraitScores related to the PSU (of which there may be many). Should this replicate column be of an integer type? And should it be made compulsory for 'raw data' submissions and should not be required (or even present) for mean data submissions?
  3. I leave it to you and to @nowakowski to sort that one out ;), but regarding your question in the last paragraph - no, PSU are not individual scores, TraitScores are. A PSU may have 0 or more TraitScores. I always assumed a PSU is an individual plant, for raw data, or could perhaps represent a set of plants, for average/mean data. And yes - PSUs are related to PlantAccessions. So, if you want to traverse from a TraitScore to the related PlantLine (if present), you do that through PSU -> PlantAccession. But that actually makes sense, doesn't it - if a PSU represents an actual plant, it should be (optionally) related to a PlantLine/PlantVariety in the data, right?
  4. So I guess this also kind of replies to the concern raised by @nowakowski in the last comment - since an individual PlantTrial/Submission is exclusively about raw data, or about mean data, we may have the three-way exclusive 'state of data' switch on the PlantTrial level, not on the PSU level. 1., 5., 8. Waiting for further comments from the experimentalists.
nowakowski commented 8 years ago

Regarding 5: Relating PSU to both PlantLine and PlantVariety is somewhat suspect from the data integrity PoV, but I can add the necessary validations to ensure that no ambiguities arise. @teatree1212 - do you mean that a PSU must always be related to either a PlantLine or a PlantVariety and that it can never be related to both at the same time?

teatree1212 commented 8 years ago

@nowakowski 'however adding this column to plant_trials instead of plant_scoring_units is tantamount to declaring that, for a given plant trial, all scoring units are of the same type (block/pot or replicate). @teatree1212 - are you sure this requirement will always be fulfilled?' Can you specify which field we are talking about? Are we talking about PSU?

nowakowski commented 8 years ago

@teatree1212 Sorry for not being precise enough - I meant the column mentioned in item 7 of your previous message ("Yes, these should be mutually exclusive and the character of the data displayed in an additional column." - let's provisionally call it scoring_unit_type.

teatree1212 commented 8 years ago

@nowakowski i , now I understand. I am not absolutely sure, but experimental trials are normally standardised and have no different features in one single trial. As soon as you have values generated from a pot and a block experiment, these are already two different trials, as one has taken place in a greenhouse and one in the field. I would even think that a trial consists of constant scoring units are of the same type generated in block/pot AND as replicate.

teatree1212 commented 8 years ago

@Nuanda in response to your unnumbered columns regarding adding to the PSU, I totally agree that this should be submitted in the sheet. This is generally the case in experimental field trial spreadsheets. replicate_score_reading is hence a technical replicate I suppose, right? This would hence be something like screen shot 2016-05-06 at 14 55 50, which you also see in the RIPR spreadsheets, ( e.g. tocopherol sheet). How can you however make sure that the user doesn't screw up the order? I would even again suggest to make the user create the template .csv after he has selected the traits, where all other compulsory columns are also named as headers. Then the user has to copy and paste the right values beneath the right header before uploading it. But a different strategy would I think confuse the user even more?

2.Okay, sorry that must have been confusing to reconstruct then. Now I unerstand. Yes, I mean Units_of _measurements, But during the submission of a new trait, this should be made compulsory field.

3.Sometimes there may not be a replicate despite it being raw data, so I wouldnt make it compulsory. For mean data, I dont think it needs to be present.

  1. a PSU should ALWAYS be related to a PlantLine/Plantvariety. Otherwise the meaning of the data gets lost. it needs to be traceable back to a Line or Variety. This is the whole point of the experiments: to find out what traits can be associated to different PlantLines/PlantVarieties. yes,I suppose, plant_scoring_units in BIP are the sample_ids in the RIPR data( I was trying to map the RIPR data against the BIP fields). Is that what you mean when saying individual plant?

Haven't heard back from experimentalists yet.

teatree1212 commented 8 years ago

I received an email which I want to share with you in this context about the RIPR data, which may also be useful for the development of the API #481 . I sent it to @Nuanda and @nowakowski . It is about the minimum requirements they think are important for a particular dataset, which was generated from a greenhouse experiment. It also shows that many columns of the dataset are not relevant for an outsider and dont need to be uploaded to the database.

teatree1212 commented 8 years ago

But after talking to Wiktor, he is of the opinion, that we should upload all information in the spreadsheets. Maybe this email can be used as basis for raw data minimum requirements of what to visualise on the website. I will meet Wiktor on Wednesday to follow up on several threads.

Nuanda commented 8 years ago

@teatree1212 In the meantime, before you talk to Wiktor...

teatree1212 commented 8 years ago

.csv file location: Yes, I have discovered it today actually (: looks good! Maybe you could move the "replace_it" to the front and call it "replace_example_sample_A_value_0" . However, I somehow expected it to be in submission step 2, appearing underneath the add new trait descriptor button. Alternatively, it could appear first thing in submission step 3 so people notice it easily. I think it is rather hidden where it is right now.

reintroduction of technical replicates: Technical replicates should Ideally be taken from the same sample, yes. Otherwise it would be a biological replicate. So this type of replicate- option has been removed from BIP? As we see that technical replicates are being used in the RIPR project, we should probably reintroduce it as it will be used in other studies, too.

Nuanda commented 8 years ago

I think we will need a kind of "2.5 step" - at the beginning of step 3 we will ask the user several questions and then we will generate a CSV template for her/him. And provided the extended complication of the parsing process, we will probably anyway require the input in the form of what was generated - otherwise, too many different scenarios...

One such question would be "What number of technical replicates were scored?" (1+). Another one could be "Do you submit/register accession numbers for individual units?" (yes/no). Yet another, for raw data only, could be the set of checkboxes for "polytunnel", "rep", "block", "pot" so the user can tick only the ones that s/he has data for (for instance, I noticed OREGIN uses "plot" assignment - we can add that option too). We may also ask if individual rows have assigned varieties/cultivar names OR line names (requiring exactly one to be selected).

This "mini-wizard" will then generate correct template file, with column names that will be easily recognizable to BIP parser.

Nuanda commented 8 years ago

(Moved year and plot overview discussion, points 1. and 8. in the above thread, to #499 for clarity).

teatree1212 commented 8 years ago

2.5 step: sounds like a good idea.

Plot vs Pot: this is because OREGIN trials are only field trial, wherease the RIPR data we have for now is only from greenhouse trials I think. Therefore, what we could do is ask for trial type somewhere in step 1 and then depending on the specifications present the respective "rep" "block" "plot" for field trial data and "polytunnel" "rep" "block" "pot" for greenhouse trials. There could also be growth chamber experiments and I don't know how they are set up.

Nuanda commented 8 years ago

Reopening - wrong issue link in @nowakowski's PR. Sorry.

Nuanda commented 8 years ago

Plot vs Pot (continued) So maybe we could give the user a bit more freedom, and present a set of 5 checkboxes: [ ] polytunnel [ ] rep [ ] block [ ] pot [ ] plot and ask to pick the set of categories that represent their trial setup? So if one chooses options 2,3, and 4, the resulting scoring template CSV will have the three "rep" "block" "plot" column headers, but no "polytunnel" or "plot" column headers?

Then, we will record that set, as "/"-separated string, in PlantTrial.design_factors column (see #502 for example value in current BIP DB), and record the three values (submitted under those three CSV headers in the uploaded file) in separate DesignFactor objects (one per PSU).

@teatree1212 What do you think?

teatree1212 commented 8 years ago

5 checkboxes sounds good as we have design_factors 1 to _5 and I don't suspect more than 5 elements come up.

I did some research into TrialDesigns and it is unfortunately not that easy to just provide checkboxes, as sometimes a rep can be equal to a plot in statistical terms.. ( see experimentalDesign fields i issue #502) and the design_type determines the statistical_factors, and hence the design_factors within an experiment.

The reason for me being hesitant is the future plans of applying analytics to these datasets. It is important to map the same elements to the corresponding fields in the database, in case the database at some point is supposed to do these automated statistical analysis. I have further contacted Graham King to explain me the current setup of the database, as all elements are int here, but I don't understand how with this setup an automatised analysis could be achieved.

In other experiments the Polytunnel may be a site for a field trial I am trying to find out what is what and how many different possibilities there are and what their overlap is. I think ultimately, the user needs to be able to map their experimental design to the corresponding fields in the database as the columns header itself can be misleading.

I think in general your approach is pretty clever and maybe we just go with it if GK's explanation of the fields does not follow our aims. This would mean that statistics would need to be done manually in the future ( what i mean is for example manually selecting blocks or reps or polytunnels/fields for ANOVA ).This means that raw data cannot automatically be turned into analysed data in the database and is therefore not available for download for the user by default, unless these analysis have been performed on it.

teatree1212 commented 8 years ago

So, after hearing back from the original database-creator,Graham King (GK) I think we should try and stick to the schema and what we insert into the fields as much as possible. The reason is that they are currently developing R -based analysis for raw data; and they were happy to share the code with us once it is done.

I see that in the design_factor _1 to_5 naming convention they keep the header a "design factor" and add the actual experimental design factor information into the fields beneath the headers together with the respective identifier from the dataset: e.g. block_1, row102.,..Looking into the database, this has been more or less consistently followed.
So if we do this, maybe we can follow your tickbox approach, as this is easier for the user, but parse it to the database to store it in the format or rather naming convention it is already done ( "< designFactoName >
< designFactorNumber/Identifier >" = block_1).

E.g. "visible text to the user are in doublequotes"

"Please select your appropriate design factors of your trial design"

-the user ticks elements from an experimental design, sorted to be decreasing in size

"[ ] field [ ] polytunnel [ ] rep [ ] block [ ] pot [ ] plot [ ] occasion [ ] rep=block" [ ] ....

... and all the other elements that are available already

Once they are selected, they are used as headers in the 2.5 spreadsheet, and the user can add their identifying numbers/identifiers ( 1 , 102, ... ) to the respective columns and upload the sheet again.

Then, this needs to be parsed in a way that the designFactorNames (at this stage) headers end up in each of the fields in the columns, in combination with the user's submitted numbers/identifiers ( 1, 102,..). And the design_factor _1 to _5 being the new headers.

Design_factor _1 to _n of the experiment will then be stored in the Design_Factors

Back to the selected designFactorNames, they should be stored together separated by / in the Plant_Trial.statistical_factors column.

I see one problem creeping up, as terms get added: Ideally in the Plant_Trial.statistical_factors column, the statistical factors should also be sorted to be decreasing in size. With the submission of new designFactorNames, they would by default be added to the bottom I suppose. In case of the tick-box approach- maybe it is possible to drag and drop the new designFactorName between the old ones, which make room for it ( I don't know how hard this is to develop) In case of the 5-field-dropdown approach, we would have to ask the user to select the designFactorNames as decreasing in size. This is a natural thing to do for them, as this is also how they think in terms of their statistical units and how the spreadsheets are set up.

What do you think?

Nuanda commented 8 years ago

15-items checkbox block looks already a bit intimidating. I think the selector-based approach may suit the user better here. So, we can have it like that:

  1. Explain the user in detail what s/he is about to describe, and why decreasing size of factors is important. Also mention the "maximum 5" limit.
  2. Start with a single select-free-text-combo: you can see an example of such a combination in the "Trait category" field in the new Trait Description form in the 2nd step of the submission.
    • here the user would either select the first design factor name from the list to the left, or supply her/his own one in the field to the right
    • I would personally try to limit the number of characters in the free input section - we don't want the user to put too many characters in there; also the "/" character will be forbidden here (because of the statistical_factors "notation")
  3. After selecting this first factor, the user will be able to add the 2nd, the 3rd...
    • we might give warnings when the user selects a larger factor after a smaller factor (but obviously we can't do that for the free-text input: we'll need to trust the user here).

The rest stays exactly as you described. Is what I propose sensible?

teatree1212 commented 8 years ago

sounds good.

  1. will take care of that
  2. I like the select-free-text combo. And I will construct a sentence about not to use / in the words ( which is probably unlikely anyways but needs to be made clear)
  3. Sounds good, too. A warning may be good. There is a pattern already visible in the examples we have, with pot being the lowest and occasion/treatment the highest.

I spotted a little issue here: some of the statistical factors are called "replicate" wherease others are called "rep" . But rep= replicate. I suppose to call these fields as example, you did a SELECT statistical_factors FROM Plant_Trials;
Can you correct the "replicate" 's to "rep"? It is a normal term everyone knows and I can also define rep=replicate somewhere in an explanatory text.

Nuanda commented 8 years ago

@kammerer Tomasz, what do you think about my proposition two comments before? Anything missing or not possible?

kammerer commented 8 years ago

@Nuanda I don't see any obvious obstacles. An idea - in the last select field we may limit user to smaller factor than the one selected before (we may show unavailable options in disabled state).

Have you thought about some way of moderation of submitted factors to make sure that the order makes sense?

Nuanda commented 8 years ago

Your idea to disable (grey out) factors bigger than the one just selected is very good. This should prevent the user to mix the sizes when using the "select" part of the widget.

No additional validations for the free text part are planned. We simply rely on user reading the instruction text. I'll add the "implementation" task for that soon.

Nuanda commented 8 years ago

@teatree1212 Have you perhaps been able to write down that explanation for the BIP user which I mentioned in bullet 1) several comments ago? In case you've posted it somewhere already, can you remind me where to look for it?

teatree1212 commented 8 years ago

No i haven't but here it is:

Please select the appropriate statistical design_factors suitable for your experimental setup from the (drop- down?) fields. Start by selecting the highest design_factor first and proceed with the next ones in decreasing size. If your factors are not present in the list provided, please add it manually. While creating new factors, please make sure you comply with the rule to submit them in decreasing size, so with the highest missing factor (e.g. treatment) first, to lowest missing factor (e.g. plot).

Nuanda commented 8 years ago

@teatree1212 Do you think we can close this thread? Are there any outstanding things here, not listed as separate GH issues, which we should address?