Closed Nuanda closed 8 years ago
In a discussion today, I identified minimum requirements for the plant trial submission which should be made compulsory and which do not include plant accessions at all ( #488). I asked them explicitly about the accessions and, in fact, these plant accessions are used differently between projects and scientists and may not have meaning for others, are hence a useless bunch of numbers and letters in a column. For internal use, they may however be very valuable. Therefore, it is most likely important to keep them "invisible(?)" for the user's own personal use. This is a practice that the SRA uses for submission. it grants the user an additional column for lab-internal accession_ids. In the case of the RIPR project, there are many accessions, I think this is because of so many universities collaborating and each of them having their own identifier.
I will send you a file containing RIPR datasets ( data not valid) but the structure you will see differs somewhat from each other, depending on whether it is raw data or final data. Also, note that one of the similarities is the use Plot/pot and replicate column, which people identified as important information, when submitting raw data (#488).
For the record - the key points of the outcome of the discussions with @wjurkowski concerning the Excel dumps of RIPR trials were as follows:
TraitScore
objects, each of which is bound to a specific PSU.DesignFactor
object assigned, containing information regarding the placement of the pot (pot #, line # etc.) - there are five generic fields in DesignFactor
which can be used for this purpose.PlantAccession
object assigned. This is the only way in which we can relate the pot to a specific genus/species/subspecies (or variety) - otherwise we would have to extend the data model, which would be undesirable for extraneous reasons. As such, the plant_accession_id
in PSU does not appear optional - it must be provided for each record. My plan was to attempt to find the corresponding PlantAccession
object for each pot and create one if no matching object exists. I am happy to use a different strategy, but at the end of the day we need to be able to relate each PSU to a specific TaxonomyTerm
(which is why the PSU<->PA relation exists).Having reviewed the discussions between @Nuanda and @teatree1212 - I hereby propose extending the PlantScoringUnit
with a direct link (foreign key) related to the PlantLine
table. This will enable us to unambiguously assigne a plant line to each plant scoring unit, without having to go through PlantAccessions
(which - as remarked above - are optional). Yes, it's a circular reference, with all the attendant data integrity problems, but the alternative is (IMHO) far worse - we would need to insert artificial PlantAccession
objects into the DB in order to score the genus/species/variety information for each pot.
If you agree with this approach, I can extend the data model accordingly, then for each sample in the RIPR data I can (1) attempt to find the referenced PlantLine
; (2) create a new PlantLine
if no suitable object can be found.
I think the direct link between PSU and PL (and I suggest maybe even PlantVariety) is a good idea @nowakowski. Please also have a look at #488.
@teatree1212 In #407 you list accession as a mandatory field in the 3rd step of the Trial submission (tabular data upload). Shouldn't we consider it rather optional, in the light of what you've learnt from the users?
Also, we need to consider the scenario that a plant accession name, uploaded by a user, is not found in BIP. In this case we might create a new record in the PlantAccessions table. However, the question is what further PA columns we should ask the user to upload? Below the current PA DB schema:
t.text "plant_accession"
t.text "plant_accession_derivation"
t.text "accession_originator"
t.text "originating_organisation"
t.text "year_produced"
t.date "date_harvested"
t.text "female_parent_plant_id"
t.text "male_parent_plant_id"
t.text "data_provenance"
t.text "data_owned_by"
t.text "confirmed_by_whom"
t.integer "plant_line_id"
Ok, the solution (also after reading #488) seems to be as follows:
Now, when parsing the scoring table uploaded by the user in the third step:
@teatree1212 I have the following questions:
@teatree1212 Do you have an example of real project scoring sheet, like the one you have for RIPR, which uses PlantLine names instead of varieties/cultivars? It would be useful to test this.
Regarding the DesignFactor - please clarify one more thing. When I check the RIPR wax sheet, I have the following for each plant:
plant_sample_id sample polytunnel rep sub_block pot_number line_number
p_0000223 40 1 1 2 4 1
This is my current understanding:
Is the above correct?
The line_number, as Lenka writes, seems to be correlated with accessions/varieties. Should we interpret that column in any way when parsing (e.g. in order to detect the PlantLine inside BIP that we should link the new PSU with)?
The problem with the Line_number in RIPR or PlantLine in BIP is that they are probably not actual "PlantLine Names" as I assume the definition for PlantLine is used in the BIP. The definition of PlantLine is ambiguous, which makes things very confusing. I feel I have tried to explain it somewhere else before, but maybe I just wanted to.
So here the story about "Lines": starting at the end of the story, the trait scores... Trait Scores are the ultimate information you want to collect from any plant types you have. Trait Scores are used to assess the diversity within the genetic material =the plants. This genetic material can come from different sources: 1) through evolution and then collection of plant material from all over the world by researchers (Diversity foundation A) 2) through crossing two plants which are found/suspected to have strongly differing trait(s) and hence genetic material in order to assess where the genetic origin for a specific trait may lay ("mapping population") 3) through "induced evolution" ( experts probably will roll their eyes at this expression..) by mutagenesis, where seeds from one plant are exposed to e.g. chemicals or radioactivity, which makes them mutagenise and the subsequent generation of that plant have slightly different genetic material than the original plant. 4) by using a diversity set that contains registered varieties to better assess their phenotypic (trait) diversity ( Diversity Foundation B).
To my understanding, this is where the term LINE comes in. I think it is mostly an experimental setup term but in the case of 2) and 3), it is considered a standalone term to identify each of the plants, as they all have different genetic material from each other and from the parent(s) and are hence totally new. it is important for (pre-) breeders and scientists to tell them apart. As, because of differing genetic material, the trait scores will be slightly different in some of those lines.. In these cases,lines are unique identifiers not just within the experiment but also for everyone outside the experiment. Some of these lines will only be used for locating a certain trait on the genome. Other lines may be used for breeding future lines as they may carry a good trait combination, which people want to combine with other lines that carry good trait combinations. other lines may already be so good that they can be turned into Cultivars, which means that their genetical material becomes fixed and becomes commercially available.
The reason why the RIPR data carries this term is more for internal experimental reasons ( and as Lenka sais because some analysis tools prefer numbers to letters). There, a line is associated with a cultivar and as cultivars are externally universally recognised identifiers for their corresponding genetic material, they are the important bits for external recognition in the dataset. This means, that the line name for a cultivar can differ from Project to Project(!!), as it is just used as internal identifier- and i think this is not useful for the BIP..
In my humble opinion, the separation of line and cultivar into two different tables in CropStore is probably only done because the intention of Crop Store was partly to facilitate what has been done in 2) and 3),- the creation of maps ( which is what all the QTLs and markers and linkage maps in BIP are about). The way I understand things is that the aim of the BIP has shifted a bit away from the CropStore aim. The BIP aims to simply store phenotypical data for now. Once this is properly sat up and things are standardised, it can be expanded and the phenotypical data integrated with other data types.
All in All, both line name and cultivar name are supposed to act as unique identifier for some genetic material in order to relate back to the traits that can be associated with this genetic material. In experimental setups where the cultivar name is used, like in RIPR, the line number in a separate column doesn't have the same informational value as in trait scoring trials with material derived from methods in 2) and 3). The RIPR project is more of a mix between 1) and 4).
@Nuanda and @nowakowski sent email with files for Line submission example and mixed submission
@Nuanda re. Design Factor the above is correct
@teatree1212 Thanks a lot for this explanation :).
The final decision wrt PlantAccessions is that we require them in every plant trial submission, one accession (not necessarily unique) per plant scoring unit row in the uploaded scoring file. We require two values, for plant_accession
and for originating_organisation
.
BTW @teatree1212 copied your lengthy explanation of plant lines to the wiki section, so it is not lost in time, like tears in rain, when I eventually close this task ;).
https://github.com/eSpectrum-IT/brassica/wiki/Annemarie's-explanation-on-PlantLines
very poetic, than you @Nuanda.
Wiktor's initial idea was to use RIPR data to test API submission (not the manual submission process) - hence we plan to develop a script to do that, see #394. During a teleconference discussion, between me, @wjurkowski and @nowakowski, we decided that a certain extension to CS PlantAccession model is needed - see #392 (I will ask Piotr to comment on that in more detail below).
@teatree1212 Annemarie suggests, and I think she is right, that a similar possibility should be inbuilt in the manual Trial submission - i.e. to be able to influence how PlantAccessions are being created in the process. Let's use this issue discussion thread to decide what needs to be done to enable that.
Now, the current BIP data model assumes that:
As you know, PLs are submitted in the other submission process (Population submission). I asked whether PAs should be submitted there as well, or should they be submitted in the Trial submission process - this stays undecided, I think, but I understand we lean towards the second solution.
Also, currently, the user only supplies the PSU name (the first column of the file uploaded in the 3rd step of the Trial submission process) - we should probably support more PSU columns. For convenience I paste the current PSU database model below (without relations):
So, submission of values for (some of) these columns might be supported as further columns in the file file uploaded in the 3rd step of the submission. But it's up to you if these columns make sense.
Going back to PAs. From what I understood during the mentioned teleconference about RIPR data, a single Plant Trial may involve a lot of PAs, so it is probably not feasible to ask the user to manually create those PA records by the means of a web form (like it is done for e.g. new Trait Descriptors, in the 2nd step). If this is correct (?), we should probably extend the 3rd step's file definition even further, by introducing more columns which would describe a PA that a particular PSU (remember - we have 1 PSU per file row) is related to.
Then, we have at least 2 issues to solve:
@nowakowski Piotr, could you explain what new columns, related to accession identifiers, we were planning to add to the PA table, and give a sample of values from the RIPR data? I think this is valuable for this discussion.