Population Submission- content and compulsory fields

teatree1212 commented 8 years ago

With regard to Subject 6 in issue #488 (restated here: when submitting a Population, e.g. 100+ single Lines may need to be submitted. This means for now in the wizard submission, that the user needs to manually create all these 100+ lines and add a taxonomy term to them. A few issues arise from here.)

Issue 1: 1: For a crossing population, a taxonomy term does not need to be made compulsory for each line submitted, as they are derived from their parents. 2: For a mutagenised population, a taxonomy term does not need to be made compulsory for each line submitted, as the mutagenised seeds are all from one parent. 3: For a diversity foundation set, the taxonomy is different, and it hence needs to be a compulsory field for each line or cultivar alike. 4: For a diversity fixed foundation set, the taxonomy is different, and it hence needs to be a compulsory field for each line or cultivar alike.

On that note, I am sharing the Dictionary, where I am working on a defition for "population" that includes all these different forms of genetic material you include in it in the BIP:

Issue 2: A submission of lines can be up to over hundred lines with different numbers (1 and 2), and cultivar names (3-4). It would be better do submit a list in .csv or even just .txt format. Necessary columns would be line name, taxonomic term, genetic status, cultivar name, of which line name and taxonomic term definitely should be made compulsory.

Nuanda commented 8 years ago

@teatree1212 Annemarie, can we put this issue on hold, until we sort out how to rebuild the Trial submission process, so then we can come back to this one? I mean - there are a lot of discussion points that remain opened at the moment... But if you think this one is required before we deal with the trial submission redesign...

teatree1212 commented 8 years ago

of course.

Nuanda commented 8 years ago

Currently, we have the following population types in BIP: "F2", "DH segregating", "Integrated", "Genetic resource collection", "Diversity core collection", "Diversity foundation", "Substitution lines", "Experimental collection", nil, "Variety resource collection", "Recombinant inbred", "F3 pooled", "F3"

(nil means there are populations that never had any type assigned). Can you tell me how do these values map to the bigger four "type groups" that you've listed? I understand that several terms from this list map to a single term from your list.

Commenting on the four alternatives that you've given:

So, we should not ask for Taxonomy term in step 2, and neither in "Add new Plant line" form in step 3, but both Parent lines in step 3 should be compulsory, and we should check if they have the same Taxonomy term?
Again, we don't ask for TT either in step 2 or in the new line form in step 3, but what about the single parent? Should we hide (and not assign) both male and female parent fields, but require "Previous line name" in the new line form? Is this where CS stores the single parent for mutagenised plants?
So we should require TT in the new line form in step 3 but we should not ask for TT in step 2?
Same as point 3 above?

teatree1212 commented 8 years ago

I have to ask people about this. Some of the population types sound dubious to me..

teatree1212 commented 8 years ago

After checking the population_types, these elements look OK, as most of them are also used as examples in cs_field_defs_9_1(see picture). I am not sure what this tells the user though. I have to find that our myself.

About one of them:

nil, SELECT population_type, name, establishing_organisation, assigned_population_name FROM Plant_Populations WHERE population_type IS NULL;

I am assuming that all these populations are actually test populations. Not 100% sure about BraVCS3M_01. It was added in 2015 by Pierre Carion, a person who was in charge of the database.

I was trying to do some more digging but the schema is not very helpful when it comes to see what is the foreign key of Plant_Populations..But possibly my database query skills are limiting here, too.

and this is me assuming that Plant_Population.id is the foreign key to Plant_Trials, which there is called "plant_population_id" (???)), but my search does not return anything:

SELECT plant_trial_name FROM Plant_Trials WHERE plant_population_id= '81';

Nuanda commented 8 years ago

You are assuming correctly. There is simply no plant_trials for this population. When you run:

SELECT DISTINCT(plant_population_id) FROM plant_trials;

you should get all PP ids for which there is at least one plant trial recorded.

teatree1212 commented 8 years ago

In conjunction with this: SELECT DISTINCT(plant_population_id) FROM Plant_Population_Lists;

a little conclusion: most nil populations are not valid populations. BraVCS3M_01 is. but doesn't seem connected (relevant?) to anything in the database.

teatree1212 commented 8 years ago

I have to think a bit more about the Population submission. I think what I have talked about are overarching population "Categories" not "types"( whatever this means in the database).

Nuanda commented 8 years ago

Perhaps I was careless to... 'disturb' this discussion in its slumber ;). Yes, let it sleep for some time, while we focus on finishing the trial submission.

teatree1212 commented 8 years ago

Note: for poulation submission, the establising_organisation needs to be mad compulsory metadata meet DOI assignemt criteria.

teatree1212 commented 8 years ago

Can you check whether there is a rule that there is only one species allowed to be associated with one plant population? This seems to be the case in the wizard trial submission, where you specify the species before adding the lines, which inhibits people from assigning different species to different lines. If this is just the case in the wizard submission, which needs to be altered anyways, that’s fine. If it is however a funny rule, then let’s change it. It is not important for this population submission but important for the next one, as it contains many different species.

Nuanda commented 8 years ago

I understand (see my comment posted on Jun 10 in this thread) that the Taxonomy Term (in step 2) only makes sense for some types of Plant Population submission, and does not make sense for other types of submission.

When posting new Plant Lines you are able to pick a Taxonomy Term per each line - so you are able to assign different species to different lines.

Will making the 2nd step Taxonomy Term field nonobligatory solve the most immediate issue you have with the incoming submission?

teatree1212 commented 8 years ago

this issue will not arise with this submission, as all samples/measurements are the same species but is a general issue and it would be good we reflect that logic already in our population submission as it stands now. having it non-obligatory solves this issue for now.

teatree1212 commented 8 years ago

In preparation for any future population submission developments according to the next contract, this is a draft template. The current ruby client includes all these elements except the year_produced field population_submission_template.xlsx , which I will add today.

Nuanda commented 8 years ago

So, the idea is to let the user submit plant lines (together with related plant accessions and plant varieties) in a form of a CSV sheet, in the 3rd step of the Plant Population submission wizard? And there will be a choice to either use the manual forms to define new plant lines or to submit via a file?

Nuanda commented 7 years ago

@teatree1212 I guess this is the best place to discuss it further. Seeing the population_submission_template which you propose, there is only one field (apart from relations, i.e. the foreign keys) pertaining to the PlantLine record:

Plant Line - which, I assume, maps to plant_lines.plant_line_name in the DB

The current manual (form-based) new Plant Line wizard in the Plant Population submission step 3 allows for a bit more data to be given for a new Plant Line: Common name, Previous line name, Genetic status, Sequence identifier. All of them optional. Should the CSV template also include these optional columns to be provided by the user at her discretion?

teatree1212 commented 7 years ago

Thanks for double checking. Yes to these three optional columns. But yes, they need to remain optional.

Nuanda commented 7 years ago

@teatree1212 I also understand that:

a user may supply an existing PV name in the Variety name column in the template, and the newly created PL will be related to this existing PV
a user may supply a new Variety name, and then, along with the specified Crop type, it will be used to create a new PV record, to be later related to the new PL record
a user may ignore this, since according to the current DB model, PL -> PV relation is not mandatory.

Is the above correct?

teatree1212 commented 7 years ago

yes, correct.

On 5 Jan 2017, at 11:17, Tomasz Gubała notifications@github.com wrote:

@teatree1212 https://github.com/teatree1212 I also understand that:

a user may supply an existing PV name in the Variety name column in the template, and the newly created PL will be related to this existing PV a user may supply a new Variety name, and then, along with the specified Crop type, it will be used to create a new PV record, to be later related to the new PL record a user may ignore this, since according to the current DB model, PL -> PV relation is not mandatory. Is the above correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TGAC/brassica/issues/494#issuecomment-270621810, or mute the thread https://github.com/notifications/unsubscribe-auth/AQpMua3hqLy0chUQ_OlHuipu-pi1pYtlks5rPNFcgaJpZM4IWOVG.

Nuanda commented 7 years ago

Is there a consensus re. what to do with PAs?

Should we create a new PAs for each PL?
Should we require PA information for every PL uploaded to the BIP?
What about the case if one uses existing PA data (the plant_accession-originating_organisation pair)? I propose to reject such PL from the submission.

teatree1212 commented 7 years ago

1)No, In the submission template, the PA will be optional. There will be multiple PAs per PL . PAs will then though be submitted through the trial submission, in case they will not be connected to existing PVs, they will always have to refer to existing PLs.

2) No we shouldn’t require such information in the Population submission ( see above).

3)Scenario: someone submits a PL and PA data. The PA data is already present in the database. a) This information must either already be associated with the PL that is being submitted (PL already exists in the database) b) Or the user is trying to submit the PA data with a wrong PL and a) is the actually correct PL -PA data combination. c) Or the PA data was previously submitted with a PV during trial submission. (Check for that?) d) The user has indeed conducted yet another experiment with some leftovers of the same seed batch (PA data) , which carry the exact information that is already submitted, but used in a different experiment.

I think it is not straight forward as simply rejecting these PAs. But testing for all these a-d scenarios is also not straight forward, and time consuming, as it means that the user will be asked to double- check information as-well..

BUT

Thinking about what Thomas wants and how I think he is looking at this experimental population: What he sees is a full record and complete information an experimentalist keeps. The database can create this full record once both population and trial information are submitted. Double-submitting the “green” ( ppt- green headers, PA- related) information during population and trial submission can cause mistakes. I think we should only use the yellow and the grey fields, making only plant variety and plant line should be made compulsory in the spreadsheet. Once the population is submitted, the information connecting Population and Plant Trial submissions will be the population name and the PLs or PVs. As the user submits the PP name, the database has to only search the list of these PLs or PAs during the trial submission and establish connections, verify their spelling /check availability.

Even though this is quite different from the previously described approach, this might be the best solution. I think that what Thomas really wants is that the user accessible PP- information that is being displayed via the databases “interface” contains these three additional columns. what do you think of this solution?

— I remind you of the ppt. presentation I recently sent, with slide 5 saying“ Gray fields: We assume that these fields are absolutely necessary for submitting sufficient information on an experimental plant population.” Thomas Alcock commented that Crop type shouldn’t be necessary. So only fields species, plant variety and plant line should be made compulsory.

On 5 Jan 2017, at 16:09, Tomasz Gubała notifications@github.com wrote:

Is there a consensus re. what to do with PAs?

Should we create a new PAs for each PL? Should we require PA information for every PL uploaded to the BIP? What about the case if one uses existing PA data (the plant_accession-originating_organisation pair)? I propose to reject such PL from the submission. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TGAC/brassica/issues/494#issuecomment-270682185, or mute the thread https://github.com/notifications/unsubscribe-auth/AQpMuRYnK_affrsayA6K3IfpWzaWcdDiks5rPRXBgaJpZM4IWOVG.

Nuanda commented 7 years ago

OK. I'll add optional PA creation during CSV-submission of new PLs for a population. Existing PLs could still be added by the input form. PT submission will stay as it is, so all the options you mention should be available to the user. For now, we will block re-use of existing PAs for new PLs.

Crop type is optional. So the only things required are PL name and Species (== taxonomy term).

Regarding two of the additional optional fields: Common name, Previous line name - would you be able to provide 1-sentence explanation of the semantics of these fields, which I can use in the manual of the submission wizard?

Or, maybe the definitions for these attributes given in the API docs https://bip.earlham.ac.uk/api_documentation#plant-line is correct and sufficient?

Nuanda commented 7 years ago

@teatree1212 Most of the work described here is done. Regarding the population_type diversification - see also my comment:

https://github.com/TGAC/brassica/issues/494#issuecomment-225167754

Now, all is possible since the user may decide to select a TT for the submitted PP, select both parents for the PP, and submit selected TT (called "Species" in the template) for every new PL associated with the PP. If you want to have the PP submission wizard more "specialised" depending on the population_type selected in step01, please tell me what else we should implement re. the taxonomy.

teatree1212 commented 7 years ago

let me summarise all this for myself and talk to an experimentalist. sorry for the wait

Nuanda commented 7 years ago

Sent by @wjurkowski: https://docs.google.com/document/d/1nNh79SEr7qQaRlCkxelML7vsSyWwVBVfdnvQoYampyA/edit

TGAC / brassica

Population Submission- content and compulsory fields #494