Open teatree1212 opened 8 years ago
Sent file to Tomasz .the tab names curated_updated with the Columns:cultivar_name curated name genetic status genetic status 2 is of interest for you.
Thank you for the file. I understand this is rather inter-project nomenclature reconciliation - now we need to consider how to apply it to BIP. Do you propose to get the B column (in the first tab of the google doc) value, try to (with some "heuristics") search for those in PlantVariety.plant_variety_name, and, if found, replace the name, saving the older name in a new column (called, for instance 'original_cs_variety_name')?
Regarding the genetic status, in the CS data model, this is a PlantLine's property. For instance, taking cultivar/PlantVariety name 'Jet neuf' I find three PlantLines of that variety (in BIP):
Neither is set as S1 (there is one DH, the other two have no genetic status listed). Should we do anything about that?
The list in column B contains not only Crop Store names, but names from various Projects some inside some not yet inside BIP. Therefore the 'original_cs_variety_name' could be called 'Synonym', hosting all Synonyms that occur.
What your example states is odd to me. From my understanding of the definition of Cultivar and Line, there can only be one Line associated with a Variety. I will ask someone about this..
@nowakowski Piotr, while we wait for @teatree1212 to consult the case brought up before, could you please try to look at the cultivar names curated (from various projects) by Annemarie and see if you can upload them (with a single rake task and/or migration) to BIP, while saving old BIP names (and the ones provided by Annemarie as synonyms), if present, in a new column called synonym
? In case of multiple alternative names a comma separated list would be sufficient, I think.
Tell us if you find any difficulties with that.
@Nuanda Just to clarify - you want them registered as new PlantVariety
objects?
Yes, but only when not present. When present with a synonym, update the name and save the synonym. When present under the name listed by Annemarie as the valid one, do nothing.
@teatree1212 I see "brocoli" is consistently spelled with a single 'c' - should I fix the misspelling or register the varieties as listed?
yes, you should correct misspelling. thanks!
I have prepared a provisional data curation task in my branch (see lib/curate_plant_varieties.rake
), however results are less than satisfactory. The task performs the following operations:
PlantLine
objects and attempt to identify those, for which the plant_line_name
attribute matches either column in the cultivars file provided by @teatree1212. Whenever a match is found, exchange the PlantVariety
object linked to the given PlantLine
for a new, "proper" PlantVariety
object, which is spawned as needed.PlantVariety
objects and attempt to identify those whose plant_variety_name
matches the "synonym" value provided by @teatree1212. For each match, rewrite the corresponding PlantVariety
record, replacing the synonym with a proper name and keeping the old name as a synonym.PlantVariety
records, i.e. all records for which no more PlantLine
objects exist (this is a clean-up task necessitated by the procedure described in step (1)).There are several reasons why this does not accomplish the intended task of curating PlantVarieties
:
PlantLine
objects in the CropStore DB are not assigned to any PlantVariety
. My task is only able to identify the correct PlantVariety
for a miniscule fraction of these objects (39 of 23435).PlantVariety
objects originally present in the DB, most fail the CSV lookup phase. Only approximately 450 are found to correspond to one of the rows in @teatree1212's Excel file - hence we end up with 3738 PlantVariety
objects of which approximately 85% are old (uncurated) records. Note that this may simply be a consequence of the fact that the Excel file omits PlantVarieties for which "nothing is to be done" -- however, this hypothesis is implausible given the fact that -- in many cases -- the name listed in the Excel file is identical to the synonym.I guess I will reflag this as a question: if we decide that this is the most we can hope for, I'll run my task on the production DB; otherwise I'm open to fresh ideas.
@teatree1212 We would like to hear your opinion regarding the algorithm proposed by @nowakowski.
I had a look at the Variety names in the BIP and think there is some serious curation needed. this looks a bit like Gülzower Ölquell but why it is in there in so many ways, no Idea.
These seem to be accessions, but where from?
These look more like lines from a crossing/mutagenised population, or maybe also accessions following a standard i dont know.
And this makes me sad
So I don't think you can do anything else right now Piotr. I will record it as a major curation issue. The only thing you could have remotely picked up is the first issue, at least the Ölquell Varieties can be picked up depending on whether you align the entire words or fractions of it with the corrected Variety.
What you can do is double check against these Cultivar name repositories which i mentioned earlier. in #447
Actually, what could be possible for now is to check for spelling errors against these repositories and my list ? and ignore the older legacy data.
Hello @teatree1212
I've handled the misspelled Gülzower Ölquell varieties as a special case (there are only three of them). The problems you identified are associated with string encoding.
Regarding the rest of the data - I have a suggestion: perhaps we could delete all existing PlantVarieties, then create new PlantVarieties based solely on the list you provided, and link existing PlantLines to those new objects (whenever a matching PlantVariety
can be found by analyzing the PlantLine.plant_variety_name
field)? There is a risk that some information could be lost in the process (out of 25690 PlantLines currently in the DB 2255 are related to an existing PlantVariety
- and for some objects this relation would disappear), however the advantages (i.e. getting rid of all the garbage) may outweigh the costs. Note that new PlantVariety
objects could be added at a later date, as appropriate.
I just saw this and will make a note of it. Will get back to you with some thoughts asap @nowakowski quickly reading over it, it makes sense to me. I read somewhere that the old CS was used as a curation database, so things would have been entered just for the sake of having it recorded. However, if there are no further informations associated with it, like a useful interaction between a line and a variety, which has value when doing analysis, We might aswell get rid of it.
@teatree1212 Annemarie, could it be that solving this issue will also help you with the varieties you were uploading to BIP in #595? Are these varieties in the list that @nowakowski is trying to upload to BIP DB?
When submitting single variety names, I got the response that they were already in the database if they were. I was hoping that if I submit all the varieties used in that trial , some would get returned as already present, but the others would get inserted. I now understand that I have to loop through my variety submissions.
If these Varieties are not in the database, maybe it is best to start "clean" and get rid of all the unrelated Varieties. People who want to submit data, are interested in also submitting the related varieties, so the table will then be populated with the -hopefully correct- varieties, always connected to lines.
@nowakowski Can you not identify the varieties that are related to PlantLines and exempt them from the deletion? And with that list and the list I made we have a good starting point for the variety table.
This could work. I'll do a dry run and report the results.
Okay, seems things are looking up:
All done. 1775 plant varieties now present in DB (1248 existing and 527 imported). 2302 plant lines (of 25690) have an assigned plant variety.
In place of the 3900+ uncurated plant varieties, we now have 527 "new" varieties (from the export file) along with 1248 "old" varieties which I left alone because they each have at least one plant line assigned and do not match any of the "new" records (i.e. their names do not correspond to the names or synonyms parsed from the export file).
While this is certainly an improvement, our task is not yet done. Note that a plant_variety
record is not merely a name and a collection of synonyms. In order for the "new" (imported) varieties to become usable we will need a more comprehensive set of attributes for each of these records. As a minimum, I feel the following should be provided:
country of origin
(can be multiple countries)country registered
(can be multiple countries)male parent
(currently a string value - perhaps this should become a PV->PV self-reference)female parent
(see above)This is merely a small subset of PlantVariety
attributes - I can produce a full list (alternatively, you can connect to the DB from the psql console and run \d plant_varieties
). @teatree1212 - do you think it would be possible to assemble a more detailed spreadsheet?
@teatree1212 Please tell @nowakowski if you are able to provide more data for the PVs or if he should go ahead and register just names and synonyms.
I may be able to provide you with more information @nowakowski. haven't forgotten about this but is not very far up on my to-do-list atm.
I have a list of curated cultivar names which already are in the BIP or are soon to be inserted. Many have not come up in the registered repositories in #447 . So this may generally be a problem when checking spelling of unavailable cultivar names in the future.
I am still waiting for feedback about some names, but will add the list of names here in early April.