cultivar names_curated - Githubissues

teatree1212 commented 8 years ago

I have a list of curated cultivar names which already are in the BIP or are soon to be inserted. Many have not come up in the registered repositories in #447 . So this may generally be a problem when checking spelling of unavailable cultivar names in the future.

I am still waiting for feedback about some names, but will add the list of names here in early April.

teatree1212 commented 8 years ago

Sent file to Tomasz .the tab names curated_updated with the Columns:cultivar_name curated name genetic status genetic status 2 is of interest for you.

Nuanda commented 8 years ago

Thank you for the file. I understand this is rather inter-project nomenclature reconciliation - now we need to consider how to apply it to BIP. Do you propose to get the B column (in the first tab of the google doc) value, try to (with some "heuristics") search for those in PlantVariety.plant_variety_name, and, if found, replace the name, saving the older name in a new column (called, for instance 'original_cs_variety_name')?

Regarding the genetic status, in the CS data model, this is a PlantLine's property. For instance, taking cultivar/PlantVariety name 'Jet neuf' I find three PlantLines of that variety (in BIP):

JET NEUF DH1, DH
NLD037_CGN07227
DEU271_CR_03041

Neither is set as S1 (there is one DH, the other two have no genetic status listed). Should we do anything about that?

teatree1212 commented 8 years ago

The list in column B contains not only Crop Store names, but names from various Projects some inside some not yet inside BIP. Therefore the 'original_cs_variety_name' could be called 'Synonym', hosting all Synonyms that occur.

What your example states is odd to me. From my understanding of the definition of Cultivar and Line, there can only be one Line associated with a Variety. I will ask someone about this..

Nuanda commented 8 years ago

@nowakowski Piotr, while we wait for @teatree1212 to consult the case brought up before, could you please try to look at the cultivar names curated (from various projects) by Annemarie and see if you can upload them (with a single rake task and/or migration) to BIP, while saving old BIP names (and the ones provided by Annemarie as synonyms), if present, in a new column called synonym? In case of multiple alternative names a comma separated list would be sufficient, I think.

Tell us if you find any difficulties with that.

nowakowski commented 8 years ago

@Nuanda Just to clarify - you want them registered as new PlantVariety objects?

Nuanda commented 8 years ago

Yes, but only when not present. When present with a synonym, update the name and save the synonym. When present under the name listed by Annemarie as the valid one, do nothing.

nowakowski commented 8 years ago

@teatree1212 I see "brocoli" is consistently spelled with a single 'c' - should I fix the misspelling or register the varieties as listed?

teatree1212 commented 8 years ago

yes, you should correct misspelling. thanks!

nowakowski commented 8 years ago

I have prepared a provisional data curation task in my branch (see lib/curate_plant_varieties.rake), however results are less than satisfactory. The task performs the following operations:

Scan PlantLine objects and attempt to identify those, for which the plant_line_name attribute matches either column in the cultivars file provided by @teatree1212. Whenever a match is found, exchange the PlantVariety object linked to the given PlantLine for a new, "proper" PlantVariety object, which is spawned as needed.
Scan all remaining PlantVariety objects and attempt to identify those whose plant_variety_name matches the "synonym" value provided by @teatree1212. For each match, rewrite the corresponding PlantVariety record, replacing the synonym with a proper name and keeping the old name as a synonym.
Expunge all "orphaned" PlantVariety records, i.e. all records for which no more PlantLine objects exist (this is a clean-up task necessitated by the procedure described in step (1)).

There are several reasons why this does not accomplish the intended task of curating PlantVarieties:

First and foremost, the vast majority of PlantLine objects in the CropStore DB are not assigned to any PlantVariety. My task is only able to identify the correct PlantVariety for a miniscule fraction of these objects (39 of 23435).
Of the 3261 PlantVariety objects originally present in the DB, most fail the CSV lookup phase. Only approximately 450 are found to correspond to one of the rows in @teatree1212's Excel file - hence we end up with 3738 PlantVariety objects of which approximately 85% are old (uncurated) records. Note that this may simply be a consequence of the fact that the Excel file omits PlantVarieties for which "nothing is to be done" -- however, this hypothesis is implausible given the fact that -- in many cases -- the name listed in the Excel file is identical to the synonym.

I guess I will reflag this as a question: if we decide that this is the most we can hope for, I'll run my task on the production DB; otherwise I'm open to fresh ideas.

Nuanda commented 8 years ago

@teatree1212 We would like to hear your opinion regarding the algorithm proposed by @nowakowski.

teatree1212 commented 8 years ago

I had a look at the Variety names in the BIP and think there is some serious curation needed. this looks a bit like Gülzower Ölquell but why it is in there in so many ways, no Idea. screen shot 2016-06-15 at 13 52 52

These seem to be accessions, but where from? screen shot 2016-06-15 at 13 53 04

These look more like lines from a crossing/mutagenised population, or maybe also accessions following a standard i dont know. screen shot 2016-06-15 at 13 53 50

And this makes me sad screen shot 2016-06-15 at 13 54 05

teatree1212 commented 8 years ago

So I don't think you can do anything else right now Piotr. I will record it as a major curation issue. The only thing you could have remotely picked up is the first issue, at least the Ölquell Varieties can be picked up depending on whether you align the entire words or fractions of it with the corrected Variety.

teatree1212 commented 8 years ago

What you can do is double check against these Cultivar name repositories which i mentioned earlier. in #447

Actually, what could be possible for now is to check for spelling errors against these repositories and my list ? and ignore the older legacy data.

nowakowski commented 8 years ago

Hello @teatree1212

I've handled the misspelled Gülzower Ölquell varieties as a special case (there are only three of them). The problems you identified are associated with string encoding.

Regarding the rest of the data - I have a suggestion: perhaps we could delete all existing PlantVarieties, then create new PlantVarieties based solely on the list you provided, and link existing PlantLines to those new objects (whenever a matching PlantVariety can be found by analyzing the PlantLine.plant_variety_name field)? There is a risk that some information could be lost in the process (out of 25690 PlantLines currently in the DB 2255 are related to an existing PlantVariety - and for some objects this relation would disappear), however the advantages (i.e. getting rid of all the garbage) may outweigh the costs. Note that new PlantVariety objects could be added at a later date, as appropriate.

teatree1212 commented 8 years ago

I just saw this and will make a note of it. Will get back to you with some thoughts asap @nowakowski quickly reading over it, it makes sense to me. I read somewhere that the old CS was used as a curation database, so things would have been entered just for the sake of having it recorded. However, if there are no further informations associated with it, like a useful interaction between a line and a variety, which has value when doing analysis, We might aswell get rid of it.

Nuanda commented 8 years ago

@teatree1212 Annemarie, could it be that solving this issue will also help you with the varieties you were uploading to BIP in #595? Are these varieties in the list that @nowakowski is trying to upload to BIP DB?

teatree1212 commented 8 years ago

When submitting single variety names, I got the response that they were already in the database if they were. I was hoping that if I submit all the varieties used in that trial , some would get returned as already present, but the others would get inserted. I now understand that I have to loop through my variety submissions.

If these Varieties are not in the database, maybe it is best to start "clean" and get rid of all the unrelated Varieties. People who want to submit data, are interested in also submitting the related varieties, so the table will then be populated with the -hopefully correct- varieties, always connected to lines.

@nowakowski Can you not identify the varieties that are related to PlantLines and exempt them from the deletion? And with that list and the list I made we have a good starting point for the variety table.

nowakowski commented 8 years ago

This could work. I'll do a dry run and report the results.

nowakowski commented 8 years ago

Okay, seems things are looking up: All done. 1775 plant varieties now present in DB (1248 existing and 527 imported). 2302 plant lines (of 25690) have an assigned plant variety.

In place of the 3900+ uncurated plant varieties, we now have 527 "new" varieties (from the export file) along with 1248 "old" varieties which I left alone because they each have at least one plant line assigned and do not match any of the "new" records (i.e. their names do not correspond to the names or synonyms parsed from the export file).

While this is certainly an improvement, our task is not yet done. Note that a plant_variety record is not merely a name and a collection of synonyms. In order for the "new" (imported) varieties to become usable we will need a more comprehensive set of attributes for each of these records. As a minimum, I feel the following should be provided:

country of origin (can be multiple countries)
country registered (can be multiple countries)
male parent (currently a string value - perhaps this should become a PV->PV self-reference)
female parent (see above)

This is merely a small subset of PlantVariety attributes - I can produce a full list (alternatively, you can connect to the DB from the psql console and run \d plant_varieties). @teatree1212 - do you think it would be possible to assemble a more detailed spreadsheet?

Nuanda commented 8 years ago

@teatree1212 Please tell @nowakowski if you are able to provide more data for the PVs or if he should go ahead and register just names and synonyms.

teatree1212 commented 8 years ago

I may be able to provide you with more information @nowakowski. haven't forgotten about this but is not very far up on my to-do-list atm.

Nuanda commented 6 years ago

https://docs.google.com/spreadsheets/d/1miqSXYnP3TycsGxviGy6jghWFS1gxAGg8DKuACsgBSA/edit?usp=sharing

TGAC / brassica

cultivar names_curated #448