Extend Trial submission to consider Plant Accessions

Nuanda commented 8 years ago

Wiktor's initial idea was to use RIPR data to test API submission (not the manual submission process) - hence we plan to develop a script to do that, see #394. During a teleconference discussion, between me, @wjurkowski and @nowakowski, we decided that a certain extension to CS PlantAccession model is needed - see #392 (I will ask Piotr to comment on that in more detail below).

@teatree1212 Annemarie suggests, and I think she is right, that a similar possibility should be inbuilt in the manual Trial submission - i.e. to be able to influence how PlantAccessions are being created in the process. Let's use this issue discussion thread to decide what needs to be done to enable that.

Now, the current BIP data model assumes that:

each TraitScore is related to exactly one PlantScoringUnit (PSUs can have multiple TSes)
each PSU may be related to one PlantAccession (PAs can have multiple PSUs)
each PA may be related to a PlantLine.

As you know, PLs are submitted in the other submission process (Population submission). I asked whether PAs should be submitted there as well, or should they be submitted in the Trial submission process - this stays undecided, I think, but I understand we lean towards the second solution.

Also, currently, the user only supplies the PSU name (the first column of the file uploaded in the 3rd step of the Trial submission process) - we should probably support more PSU columns. For convenience I paste the current PSU database model below (without relations):

    t.text     "scoring_unit_name" # This is already set by the user
    t.text     "number_units_scored"
    t.text     "scoring_unit_sample_size"
    t.text     "scoring_unit_frame_size"
    t.date     "date_planted"
    t.text     "described_by_whom"
    t.text     "comments"
    t.text     "entered_by_whom" # This is already set, to the user's name, by BIP
    t.date     "date_entered" # This is already set, to the current date, by BIP
    t.text     "data_provenance"
    t.text     "data_owned_by"
    t.text     "confirmed_by_whom"

So, submission of values for (some of) these columns might be supported as further columns in the file file uploaded in the 3rd step of the submission. But it's up to you if these columns make sense.

Going back to PAs. From what I understood during the mentioned teleconference about RIPR data, a single Plant Trial may involve a lot of PAs, so it is probably not feasible to ask the user to manually create those PA records by the means of a web form (like it is done for e.g. new Trait Descriptors, in the 2nd step). If this is correct (?), we should probably extend the 3rd step's file definition even further, by introducing more columns which would describe a PA that a particular PSU (remember - we have 1 PSU per file row) is related to.

Then, we have at least 2 issues to solve:

how users identify PAs that are already existing in BIP? (probably -> by name)
what even further columns a user should supply in the file, in case one wants a new PA record to be created?

@nowakowski Piotr, could you explain what new columns, related to accession identifiers, we were planning to add to the PA table, and give a sample of values from the RIPR data? I think this is valuable for this discussion.

teatree1212 commented 8 years ago

In a discussion today, I identified minimum requirements for the plant trial submission which should be made compulsory and which do not include plant accessions at all ( #488). I asked them explicitly about the accessions and, in fact, these plant accessions are used differently between projects and scientists and may not have meaning for others, are hence a useless bunch of numbers and letters in a column. For internal use, they may however be very valuable. Therefore, it is most likely important to keep them "invisible(?)" for the user's own personal use. This is a practice that the SRA uses for submission. it grants the user an additional column for lab-internal accession_ids. In the case of the RIPR project, there are many accessions, I think this is because of so many universities collaborating and each of them having their own identifier.

I will send you a file containing RIPR datasets ( data not valid) but the structure you will see differs somewhat from each other, depending on whether it is raw data or final data. Also, note that one of the similarities is the use Plot/pot and replicate column, which people identified as important information, when submitting raw data (#488).

nowakowski commented 8 years ago

For the record - the key points of the outcome of the discussions with @wjurkowski concerning the Excel dumps of RIPR trials were as follows:

A PSU will be defined for each plant (i.e. for each pot) participating in the trial. This is due to the fact that different trait scores are available for each pot and these need to be stored in separate TraitScore objects, each of which is bound to a specific PSU.
Each PSU will have a DesignFactor object assigned, containing information regarding the placement of the pot (pot #, line # etc.) - there are five generic fields in DesignFactor which can be used for this purpose.
A PSU must have a PlantAccession object assigned. This is the only way in which we can relate the pot to a specific genus/species/subspecies (or variety) - otherwise we would have to extend the data model, which would be undesirable for extraneous reasons. As such, the plant_accession_id in PSU does not appear optional - it must be provided for each record. My plan was to attempt to find the corresponding PlantAccession object for each pot and create one if no matching object exists. I am happy to use a different strategy, but at the end of the day we need to be able to relate each PSU to a specific TaxonomyTerm (which is why the PSU<->PA relation exists).

nowakowski commented 8 years ago

Having reviewed the discussions between @Nuanda and @teatree1212 - I hereby propose extending the PlantScoringUnit with a direct link (foreign key) related to the PlantLine table. This will enable us to unambiguously assigne a plant line to each plant scoring unit, without having to go through PlantAccessions (which - as remarked above - are optional). Yes, it's a circular reference, with all the attendant data integrity problems, but the alternative is (IMHO) far worse - we would need to insert artificial PlantAccession objects into the DB in order to score the genus/species/variety information for each pot.

If you agree with this approach, I can extend the data model accordingly, then for each sample in the RIPR data I can (1) attempt to find the referenced PlantLine; (2) create a new PlantLine if no suitable object can be found.

teatree1212 commented 8 years ago

I think the direct link between PSU and PL (and I suggest maybe even PlantVariety) is a good idea @nowakowski. Please also have a look at #488.

Nuanda commented 8 years ago

@teatree1212 In #407 you list accession as a mandatory field in the 3rd step of the Trial submission (tabular data upload). Shouldn't we consider it rather optional, in the light of what you've learnt from the users?

Also, we need to consider the scenario that a plant accession name, uploaded by a user, is not found in BIP. In this case we might create a new record in the PlantAccessions table. However, the question is what further PA columns we should ask the user to upload? Below the current PA DB schema:

    t.text     "plant_accession"
    t.text     "plant_accession_derivation"
    t.text     "accession_originator"
    t.text     "originating_organisation"
    t.text     "year_produced"
    t.date     "date_harvested"
    t.text     "female_parent_plant_id"
    t.text     "male_parent_plant_id"
    t.text     "data_provenance"
    t.text     "data_owned_by"
    t.text     "confirmed_by_whom"
    t.integer  "plant_line_id"

Nuanda commented 8 years ago

Ok, the solution (also after reading #488) seems to be as follows:

add links to PlantLine and PlantVariety to PSU table, keep the link to PlantAccession as it is
make links to PL and PV mutually exclusive (no PSU should be related to both of these at once), but required (i.e. every PSU is related to either a PL or a PV)
keep the PSU's relation to PA optional.

Now, when parsing the scoring table uploaded by the user in the third step:

if a given PA name was not found in the BIP, create a new PA record with that name and link the new PSU to that record.

@teatree1212 I have the following questions:

Should we do the same for PlantVarieties? For instance, in the RIPR data, assume 'Fortis' is not recognized as a BIP PV name. Should the system create a new PV record called 'Fortis' and link that new record to the submitted PSU?
How do we deal with cases when PSU relates to a PL rather, than directly to a PV? One possibility would be, for instance, to have each value from the CSV column compared against existing PlantLines first, if not found, compared against existing PlantVarieties, and if also not found - see the point 1. above. Is that heuristics correct? We might call that CSV column "line/variety/cultivar name" in the CSV template generated for the user.

teatree1212 commented 8 years ago

The RIPR-data will as I said need to link to PlantVarieties rather than Lines. Therefore, a new PV ( e.g. fortis) record should be created. All RIPR cultivar names should be in the cultivar names_final google doc. Maybe you can somehow import them into the database before- or use this as testing the creation and import of new variety names. In the google doc, some RIPR cultivar names have been slightly changed ( added ' for example) -so need recognising and renaming once they are submitted. in #447 I also added other cultivar repositories where spelling can be checked and two ways of how you could take advantage of exiting spelling-checking code.
I think that strategy makes sense. However, PlantLines will always be part of a PlantPopulation, as all lines from one Populations have the same parents. Therefore, if the PlantLine is not found, it may not necessarily mean that it is automatically a variety/cultivar name, but that the PlantPopulation information has not been submitted. So maybe, should the name not come up in PL and PV, it may be helpful to return a message to ask the user to clarify whether they are PVs or PLs and in case of the latter, submit the population first.

Nuanda commented 8 years ago

@teatree1212 Do you have an example of real project scoring sheet, like the one you have for RIPR, which uses PlantLine names instead of varieties/cultivars? It would be useful to test this.

Regarding the DesignFactor - please clarify one more thing. When I check the RIPR wax sheet, I have the following for each plant:

plant_sample_id  sample  polytunnel  rep  sub_block  pot_number  line_number
p_0000223         40           1     1          2          4              1

This is my current understanding:

polytunnel == 1 means the "small tunnel" in this experiment case (I recognize this only through rep number)
rep == 1 means the upper half of the "small tunnel"
sub_block == 2 means the second, light-green/dark-green rectangle in the "small tunnel"
pot_number == 4 means the exact F-6 cell in the "small tunnel" sheet

Is the above correct?

The line_number, as Lenka writes, seems to be correlated with accessions/varieties. Should we interpret that column in any way when parsing (e.g. in order to detect the PlantLine inside BIP that we should link the new PSU with)?

teatree1212 commented 8 years ago

The problem with the Line_number in RIPR or PlantLine in BIP is that they are probably not actual "PlantLine Names" as I assume the definition for PlantLine is used in the BIP. The definition of PlantLine is ambiguous, which makes things very confusing. I feel I have tried to explain it somewhere else before, but maybe I just wanted to.

So here the story about "Lines": starting at the end of the story, the trait scores... Trait Scores are the ultimate information you want to collect from any plant types you have. Trait Scores are used to assess the diversity within the genetic material =the plants. This genetic material can come from different sources: 1) through evolution and then collection of plant material from all over the world by researchers (Diversity foundation A) 2) through crossing two plants which are found/suspected to have strongly differing trait(s) and hence genetic material in order to assess where the genetic origin for a specific trait may lay ("mapping population") 3) through "induced evolution" ( experts probably will roll their eyes at this expression..) by mutagenesis, where seeds from one plant are exposed to e.g. chemicals or radioactivity, which makes them mutagenise and the subsequent generation of that plant have slightly different genetic material than the original plant. 4) by using a diversity set that contains registered varieties to better assess their phenotypic (trait) diversity ( Diversity Foundation B).

To my understanding, this is where the term LINE comes in. I think it is mostly an experimental setup term but in the case of 2) and 3), it is considered a standalone term to identify each of the plants, as they all have different genetic material from each other and from the parent(s) and are hence totally new. it is important for (pre-) breeders and scientists to tell them apart. As, because of differing genetic material, the trait scores will be slightly different in some of those lines.. In these cases,lines are unique identifiers not just within the experiment but also for everyone outside the experiment. Some of these lines will only be used for locating a certain trait on the genome. Other lines may be used for breeding future lines as they may carry a good trait combination, which people want to combine with other lines that carry good trait combinations. other lines may already be so good that they can be turned into Cultivars, which means that their genetical material becomes fixed and becomes commercially available.

The reason why the RIPR data carries this term is more for internal experimental reasons ( and as Lenka sais because some analysis tools prefer numbers to letters). There, a line is associated with a cultivar and as cultivars are externally universally recognised identifiers for their corresponding genetic material, they are the important bits for external recognition in the dataset. This means, that the line name for a cultivar can differ from Project to Project(!!), as it is just used as internal identifier- and i think this is not useful for the BIP..

In my humble opinion, the separation of line and cultivar into two different tables in CropStore is probably only done because the intention of Crop Store was partly to facilitate what has been done in 2) and 3),- the creation of maps ( which is what all the QTLs and markers and linkage maps in BIP are about). The way I understand things is that the aim of the BIP has shifted a bit away from the CropStore aim. The BIP aims to simply store phenotypical data for now. Once this is properly sat up and things are standardised, it can be expanded and the phenotypical data integrated with other data types.

All in All, both line name and cultivar name are supposed to act as unique identifier for some genetic material in order to relate back to the traits that can be associated with this genetic material. In experimental setups where the cultivar name is used, like in RIPR, the line number in a separate column doesn't have the same informational value as in trait scoring trials with material derived from methods in 2) and 3). The RIPR project is more of a mix between 1) and 4).

teatree1212 commented 8 years ago

@Nuanda and @nowakowski sent email with files for Line submission example and mixed submission

teatree1212 commented 8 years ago

@Nuanda re. Design Factor the above is correct

Nuanda commented 8 years ago

@teatree1212 Thanks a lot for this explanation :).

Nuanda commented 8 years ago

The final decision wrt PlantAccessions is that we require them in every plant trial submission, one accession (not necessarily unique) per plant scoring unit row in the uploaded scoring file. We require two values, for plant_accession and for originating_organisation.

Nuanda commented 8 years ago

BTW @teatree1212 copied your lengthy explanation of plant lines to the wiki section, so it is not lost in time, like tears in rain, when I eventually close this task ;).

https://github.com/eSpectrum-IT/brassica/wiki/Annemarie's-explanation-on-PlantLines

teatree1212 commented 8 years ago

very poetic, than you @Nuanda.

TGAC / brassica

Extend Trial submission to consider Plant Accessions #481