Closed cymon closed 4 weeks ago
@melinalou
In general I applaud the strategy towards meaningless identifiers, decoupling them from the actual expectations and meaning that can easily be found one level deeper in whatever corresponding field or column. It typically also helps avoiding errors.
However, 'uniqueness' is not the only concern here. We need to also cover 'persistence'.
Persistence is about this "Once the key is generated it would indeed, but not only need to be used throughout other EMO BON procedures; additionally it should never be replaced (by an easily new generated uuid) or lost". Any "natural key" should always lead to the same uuid to be an assured long living testament to be representing what it should be representing. I think there is very little guarantee inside google sheets for this? Which would require us to go look for such guarantee in an identifiable "natural key" in the data, bringing us essentially back to where we are?
So to conclude:
I think we cannot
I think we can
Sigh, we might be facing https://en.wikipedia.org/wiki/The_law_of_conservation_of_misery
However, 'uniqueness' is not the only concern here. We need to also cover 'persistence'.
Cymon: Persistence is about this "Once the key is generated it would indeed, but not only need to be used throughout other EMO BON procedures;
Marc: additionally it should never be replaced (by an easily new generated uuid) or lost". Agreed, were this to happen then the system crumbles - but the same applies, and to a greater extent, to a natural key like source_mat_id where the fields making up the key can change, thereby changing the key.
Marc: Any "natural key" should always lead to the same uuid to be an assured long living testament to be representing what it should be representing. I think there is very little guarantee inside google sheets for this? Which would require us to go look for such guarantee in an identifiable "natural key" in the data, bringing us essentially back to where we are?
I guess my point is that if we have UUIDs we dont need and should not use a natural key: they are superfluous and create confusion if used in-place of the UUID when the natural key may not be unique.
So to conclude:
I think we cannot
* trust our flexible google sheets will not be regenerating certain uuid making them no longer persistent? (or making us forget the earlier one)
If that is the case, then yes the UUID would have all the same problems as the natural key (source_mat_id) and be completely pointless in a Google Spreadsheet. That people do use them in GS suggests to me that isnt the case: but we need to check.
* ignore nor prevent people potentially make errors in data-entry (abusing that same spreadsheet flexibility) -- we can only have automated QC warn about it and exclude from further processing
The use of UUIDs isnt intended to prevent errors, no identifier can.
I think we can
* keep having natural keys as the basis for detecting if anyone goofed up
So you want to use a potentially non-unique key to check that the unique key is unique? (I kid, just not understanding your point)
* also introduce meaningless identifiers to be used further down the chain, i.e. * keep them outside the google spreadsheet, but governed by a script that can't be fooled about earlier generated uuid
Now this is a possibility: we can just layer on the UUIDS during QC/validation and use them downstream and let the EMO BON recorders and data curators deal with the problems caused by a potentially non-unique source_mat_ids.
I'd be +1 to go this route and not have to deal this the source_mat_ids in all the automated downstream analysis: we generate a UUID per event - if two events have the same source_mat_id, then not a problem for downstream.
I think I would agree with Marc on this one.
I think we should have both the natural keys (so, the source_mat_id) and the UUID (provided that we are 100% sure that they will not be repeated e.g. across sheets).
The thing is that I would like the source_mat_id to appear as the sample alias in ENA. Since there is a logic behind it, everyone would be able to understand quite easily with sample_name (e.g. EMOBON003781) corresponds to which actual physical sample. If everything else is lost, this source_mat_id will still be informative enough for all the subsequent users, whereas a UUID will not.
As we have seen already in the previous operation of EMO BON this has caused issues with duplicate "source_mat_id"s and malformed, or mis-translated ids from one sheet to another - some 90 or so key's were missing from the run-information-batch-XXX.csv sheets.
@cymon which are the run-information-batch-XXX.csv sheets you are referring to, that are missing ids?
This was my original idea all along...keep the source mat IDs and yes, if we find duplicates we fix them, but with UUIDs that we create on our side (not in the googlesheets), this will not mess up the harvest-qc-transformation and it is a better way to "tag" in the ttl files because these will never change. The question was: would we use the uuids in ena or not? I think Christina answered that question. Not a problem - we just have to make the ENA workflow fully dependend on a 100% correct transformed logsheets. And on that, @marc-portier, should we consider having the ENA workflow run from the TTLs, rather than the CSVs? ie all exports run from the ttl, no matter what they are being exported to. ...tho that would require a change to the DwC code that we wrote for ARMs. still, to be taken up in another discussion
On Mon, 30 Sept 2024 at 23:00, Christina Pavloudi @.***> wrote:
I think I would agree with Marc on this one.
I think we should have both the natural keys (so, the source_mat_id) and the UUID (provided that we are 100% sure that they will not be repeated e.g. across sheets).
The thing is that I would like the source_mat_id to appear as the sample alias in ENA. Since there is a logic behind, everyone would be able to understand quite easily with sample_name (e.g. EMOBON003781
So now we are talking about 3 different identifiers: a) UUID b) source_mat_id c) EMOBON ref_code (not sample_name)
The problem at present is that only the source_mat_id gets used in the Google Spreadsheets and that has the potential to be non-unique, which is why we should use UUIDs. The ref_code identifier is unique but is only added in the run_information_batch sheets and never appears in the raw Google Sheets. We have to use the source_mat_ids to locate the ref_code for the sequence data and downstream. I add the ref_code back to the validated combined sampling/measured observatory log sheets (e.g. AAOT water column https://github.com/emo-bon/emo-bon-data-validation/blob/main/validated-data/observatories/AAOT_water_column_combined_validated.csv) and the combined batch tables (ie. here https://github.com/emo-bon/emo-bon-data-validation/blob/main/validated-data/Batch1and2_combined_logsheets_2024-09-19.csv). Once the ref_code is assigned it should be used downstream in ENA/MetaGOflow etc...
The problem remains the source_mat_id in the GoogleSheets being potentially non-unique. If we want to keep the source_mat_ids, then we need to make sure they are unique. If this is done then adding UUIDs would be pointless and just adds unnecessary additional complexity.
corresponds to which actual physical sample. If everything else is lost,
this source_mat_id will still be informative enough for all the subsequent users, whereas a UUID will not.
Not if it is not unique.
As we have seen already in the previous operation of EMO BON this has caused issues with duplicate "source_mat_id"s and malformed, or mis-translated ids from one sheet to another - some 90 or so key's were missing from the run-information-batch-XXX.csv sheets.
@cymon https://github.com/cymon which are the run-information-batch-XXX.csv sheets you are referring to, that are missing ids?
This was when I first ran the validators and it was discovered the source_mat_id formula was wrong on some sheets; Melanthia fixed it.
If you've decided that you want to keep the source_mat_id's then we need to find a way in the Google sheets to ensure they are unique. (and not implement UUIDs).
This is getting to the level where it deserves a conversation to be than summed up here.
Also think we need to make some clear wording:
uuid is a broad term, when pointing to the technical spec it is worth understanding it consists of many variants (versions) suitable for many cases (e.g. v3 for reproducable hashes, v4 for generated unique)
having "meaningless identifiers" is considered good practice --> all associated information (including the parts consisting of the uniqueness, i.e natural key) should be easy to find - and do not reduce "clarity" imho
and I found another duplicate, in the first logsheet I looked at - https://docs.google.com/spreadsheets/d/1kSlTcNfXaCQv7fIKVbBnt5PmRW3_H7Trryh4FJHyLkE/edit?gid=753216632#gid=753216632 being water row 443 and row 448 are the same as the size_frac_low are the same while the size_frac_up are different
https://docs.google.com/spreadsheets/d/1-1VsUUbRtKselxu-y2BghRIrJdaf74SbmvK42ntvd20/edit?gid=1167865467#gid=1167865467 has repeats also because same size frac up and low but different volumes
So ............ ?!?!?!?!
I checked 2 others and they were fine (once I corrected a mistake in a date). But clearly, we cannot rely on the material sample ids being unique as they are now unless we actually start deleting data from the logsheets (for example, do we really need to sequence data from different volumes)?
I think we need a UUID approach for data management. If we want to keep human-readable mat samps IDs we either need to come up with a unique approach that includes very little info (e.g. event ID + an iterator) or all possible info. Bear in mind that we can add size frac and volumes and all sorts of other things in ENA in the sample template.
The metadata are wrong......... It's the providers issue, they have added the information in the wrong places, they have filled in cells incorrectly. In EMO_BON_Metadata_Water_Column_VB_IMEV, lines 442-446 are
BUT it's wrong, I will update my comment later on because it should be corrected.
Ok, so coming back to this... WaSOP2 involves sampling with a 20 um plankton net and the samples are later filtered in a 10 um filter. WaSOP3 involves sampling with a 200 um plankton net and the samples are later filtered in a 10 um filter. Which means that it is too confusing to add information to these cells... One could say for WaSOP2 that size_frac_low should be 20 since this is the mesh size of the plankton net so organisms higher than 20 um are retained in the filter. However, later this is filtered at a 10 um, so I would say that now size_frac_low becomes 10 and size_frac_up becomes 20, so size_frac should be 10-20. Similarly with the WaSOP3.
The blanks (line 446 and line 451) should both have size_frac_low 10 and size_frac_up NA. They have nothing to do with actual plankton nets, they are both distilled water passed through a 10 um filter.
So, I see what you mean with the duplicates... Should we correct (once more) the equation and include the size_frac instead (columnn Q in this case)?
In EMO_BON_Metadata_Water_Column_UMF_UmU, I see an example in lines 6 and 17. I don't know, should we also add the volume (column O) in the equation?
So having the mat samp ID as we do now does provide a useful way to check on validity in the columns :-) - tho of course we can do that in the QC (note to self: check that the QC has a check that size_frac_up > size_frac_low and ask Christina where NAs are and are not allowed).
The only thing I am worried about is that what if another thing pops up that makes us have to add yet another thing to the sample ID? In a way that is why I like my idea of the sample id being eventID_[iterator] -> it will never be repeated (the only thing that will be hard to do will be to encode this in a googlesheet - it would have to be done on our end). Is there a reason not to do this?
On Tue, 1 Oct 2024 at 23:44, Christina Pavloudi @.***> wrote:
Ok, so coming back to this... WaSOP2 involves sampling with a 20 um plankton net and the samples are later filtered in a 10 um filter. WaSOP3 involves sampling with a 200 um plankton net and the samples are later filtered in a 10 um filter. Which means that it is too confusing to add information to these cells... One could say for WaSOP2 that size_frac_low should be 20 since this is the mesh size of the plankton net so organisms higher than 20 um are retained in the filter. However, later this is filtered at a 10 um, so I would say that now size_frac_low becomes 10 and size_frac_up becomes 20, so size_frac should be 10-20. Similarly with the WaSOP3.
The blanks (line 446 and line 451) should both have size_frac_low 10 and size_frac_up NA. They have nothing to do with actual plankton nets, they are both distilled water passed through a 10 um filter.
So, I see what you mean with the duplicates... Should we correct (once more) the equation and include the size_frac instead (columnn Q in this case)?
In EMO_BON_Metadata_Water_Column_UMF_UmU https://docs.google.com/spreadsheets/d/1-1VsUUbRtKselxu-y2BghRIrJdaf74SbmvK42ntvd20/edit?gid=1167865467#gid=1167865467, I see an example in lines 6 and 17. I don't know, should we also add the volume (column O) in the equation?
You'll regret using natural keys by Mark Seemann https://blog.ploeh.dk/2024/06/03/youll-regret-using-natural-keys/
On Wed, 2 Oct 2024 at 07:56, Katrina Exter @.***> wrote:
So having the mat samp ID as we do now does provide a useful way to check on validity in the columns :-) - tho of course we can do that in the QC (note to self: check that the QC has a check that size_frac_up > size_frac_low and ask Christina where NAs are and are not allowed).
The only thing I am worried about is that what if another thing pops up that makes us have to add yet another thing to the sample ID? In a way that is why I like my idea of the sample id being eventID_[iterator] -> it will never be repeated (the only thing that will be hard to do will be to encode this in a googlesheet
A bigint series is typically how you define a PRIMARY KEY in a relational database table, so google sheets has them:
Automatically create a series https://support.google.com/docs/answer/75509?hl=en&co=GENIE.Platform%3DDesktop
The problem with assigning a unique series to all EMOBON sampling events is that the sampling events are spread among 20+ separate sheets and a series is only unique to a sheet. This is why the unique identifier "ref_code" is assigned when the run_batch_information sheets are collated (this is great and we can use this ultimate unique identifier for all downstream). The ref_codes could be added to the GoogleSheets retrospectively, but this does not solve the problem of the source_mat_id (which is still the only identifier in the sheets) not being unique - if a ref_code is assigned to a specific source_mat_id, but there are two of them (duplicated) in the sheets, then which one is it?
There two options:
a) ensure that the source_matid is unique - this is proving difficult for
now obvious reasons - it would have to be constrained to be unique e.g. here
https://webapps.stackexchange.com/questions/19927/forcing-uniqueness-in-a-google-spreadsheets-column
and here
https://support.google.com/docs/thread/150859914/problem-using-data-validation-to-prevent-duplicate-entries-in-google-sheets?hl=en
b) define another unique identifier in the googlesheets, either a UUID, or
some series like "eventID[series iterator]" (I dont know what eventID here
is) or "
eventID is Observatory_sampletype_date
Ah! I have an idea! What if, instead of the UUID that Cymon mentioned, we use the MD5 checksums? We have two checksums for each sample (since we do paired end sequencing) but maybe it is worth while thinking if we could explore it somehow?
@cpavloud that is quite exactly the same as my suggestion for using uuid.v3 (which is based on md5 hashes according to wikipedia)
it combines and obfuscates the parts of our natural key so people stop making assumptions about the key, still allows to persistently generate the same key out of the parts, and keeps being able to detect that people are reusing these parts in places where they should not
it does, as we already realise, not prevent people from making that kind of mistakes
but I would argue that if we go for generated (ie uuid.v4 as @cymon suggested) you will not only encounter the continued possibility of errors being either attaching the wrong natural key parts to it (which you of course could still detect as colliding with earlier use) but have also introduced the new possibility of forgetting (by lack of persistence) about earlier generated keys that are already in use as references deeper in the process?
hm, this makes me think about the timing of all this in the flow, that might help us decide what decision to make here?
what if that later process usage is in fact so elaborate that fixing the earlier steps is easier / has lower impact?
the main benefit of totally detached id ( i.e. generated random uuid.v4 like or some sequence - anything not calculated / hashed out of the content) is that one can go in (late) and fix the natural key parts describing it without affecting the key --- so really the question we should ask is "how fast can we detect and correct these errors (only by duplicate use? -- that might (a) be late and (b) could in fact show that not the second but the first use was in error)" -- can we ensure ironing these out before these get used as referrals further down in the project, and/or what should be the natural / accepted impact of a late detection?
still, I remain convinced that such detached keys introduce the possibility of forgetting (lacking persistence) and thus creating the opposite problem -- not mixing or colliding parts of the data that should be kept separate -- but in stead splitting/partitioning parts of the data-graph that should be connected -- so a fix to those situations would need to be assessed and foreseen
We have 2 issues here 1- the key as used in the logsheets - we need something there that looks regonisable to humans but is unique 2- the key to actually use in the data management
So I am in favour of using UUIDs generated on the GH side, rather than the googlesheet, and a more simple key in the logsheets and can use in ENA (we can also add the UUIDs in ENA as a free field) - my suggestion of eventId_iterator would mean that for this logsheet - https://docs.google.com/spreadsheets/d/1F4AWv_seI-DQJ_Gp_N2ziwvGyjnc4KC92GGNXrJRek4/edit?gid=1191455358#gid=1191455358 - the first 6 entries in the final column would be
EMOBON_NRMCB_Wa_210621_1
EMOBON_NRMCB_Wa_210621_2
EMOBON_NRMCB_Wa_210621_3
EMOBON_NRMCB_Wa_210621_4
EMOBON_NRMCB_Wa_210621_5
EMOBON_NRMCB_Wa_210621_6
and then when we have a new date, the iterator starts again at 1
The facts that the first 4 are 3mum samples for replicate 1-4 and that the second two are blanks for 3mum are contained in the replicate and size_frac_low columns
...had a bit of a think overnight. Realised that if the eventID+iterator is unique (which it would be in the lack of logsheet errors) we do not need the UUID. However, I am not sure that we can create this eventID in the way I suggest in the googlesheet as I am not sure there is an equation that will add an iterator only within the same set of eventID...if you see what I mean. Question is: does this matter? We can create this after harvesting and then just keep the material sample id as it is now (but not use it as a key anywhere)
OK guys, with an extreme urgency we need to decide what to do here An update to what I suggested above that comes from Cymon is the source_mat_ids are formulated this way EMOBON_NRMCB_Wa_210621_1 EMOBON_NRMCB_Wa_210621_2 EMOBON_NRMCB_Wa_210621_3 EMOBON_NRMCB_Wa_210621_4 5 6 7 8 9 10 i.e. the number keeps going up regardless of the event that sample belongs to. this is easy to do in the logsheets AND we can always decide that when we harvest the logsheets, on the GH side we change that to my suggestion above, which is that we iterate over each event only
Do we need a meeting?
This could be simplified even more, and be something like EMOBON_NRMCB_Wa_1 EMOBON_NRMCB_Wa_2 EMOBON_NRMCB_Wa_3 EMOBON_NRMCB_Wa_4
and so on and so forth so that it will never need to start re-iterating
So maybe simple is the way to go?
yes, I would agree with that one. @cymon and @melanthia ? It will require changing the equation in all the logsheets but that should be simple - Cymon is the spreadsheet master and he can advise :-D
On Wed, 16 Oct 2024 at 15:28, Katrina Exter @.***> wrote:
yes, I would agree with that one. @cymon https://github.com/cymon and @melanthia https://github.com/melanthia ? It will require changing the equation in all the logsheets but that should be simple - Cymon is the spreadsheet master and he can advise :-D
He's really not ;)
but this would guarantee uniqueness of the key regardless of what recorders put in the other columns, and this is what we need.
So I'm +1
C. --
Cymon J. Cox
Senior Researcher Plant Systematics and Bioinformatics Digital Laboratory Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc.
Mailing address: CCMAR - Centro de Ciencias do Mar, Universidade do Algarve Campus de Gambelas Edif. 7 8005-139 Faro Portugal
Phone: +351 289800051 ext 7380 Fax: +351 289800051 Email: @.***
https://www.cienciavitae.pt/6B15-9771-1D04 GPG: Public key on keyserver.ubuntu.com
Ok then! Simplicity is the way to go! Maybe @melinalou can do the changes in the equations?
The way to do the equation would be
Water sampling tab, final column =CONCATENATE(observatory!$A$2,"_",observatory!$R$2,"Wa","",????) (in the first row and then drag-drop down) The ??? is something I hope one of you knows: how to increment a number downwards (i.e. 1 then 2 then 3 etc as you go down the rows)
Water measured tab, first column: no change, just =sampling!AN2 (in the first row and then drag-drop down) BUT please check that this is what they all say as some of them are referring to OLD SAMPLING rather than SAMPLING
Sediment sampling tab, final column =CONCATENATE(observatory!$A$2,"_",observatory!$R$2,"So","",????) (for the first row and then drag drop down)
Sediment measured tab, first column: needs to say =sampling!AL2 (for the first row and then drag drop down)
I will do the changes by tomorrow!
all done!
you forgot the final _ before the final number. Can you add that please? Also, you need to update the definition just for source_mat_id: you can say "This identifier has to be unique to each row: it consists of the terms EMOBON_Observatory_Sample type_an iterator (1,2,3 etc), and this is created via a google equation. To add this identifier to a new row, simply drag the value from the cell above."
oooh, and also - you should do the ARMS ones (using the phrase "Ha") also.
you forgot the final _ before the final number. Can you add that please?
I will do it now but just for the stations from water batch 1 and 2 ("ROSKOGO", "BPNS", "RFormosa", "NRMCB", "IUIEilat", "HCMR-1", "AAOT", "OSD74" , "PiEGetxo", "EMT21", "VB" , "ESC68N" ,"MBAL4") and sediment batch 1 and 2 ( "ROSKOGO", "BPNS", "RFormosa","OOB" ,"NRMCB" ). There is no time to do the rest. @melinalou can you please check them?
grand, thanks - so as a remind the value should look like EMOBON_BPNS_Wa_1 the only thing we have to protect against is this: indeed if the _1 _2 _3 etc is added just thru the drag-drop action, then if someone did move a row around, its sample ID would change and this is a no-no. Now, I cannot imagine why anyone would move a row .... which means that someone at some point will. Is there a way to tackle this?
I also thought about this. I think that, after we drag the equation to fill in the cells, we can copy-paste-only-values in all the cells but the last one. This way we will keep the equation alive in the last cell, but freeze the rest of the cells.
eh?
Check this file for example https://docs.google.com/spreadsheets/d/1mEi4Bd2YR63WD0j54FQ6QkzcUw_As9Wilue9kaXO2DE/edit?gid=1482286037#gid=1482286037
I copied the column AN and pasted it in the column AN but as values.
ahhh...in fact, we don't need that any of the logsheets has this column as an equation! We just need to always make the first one and then all subsequent ones are dragged down -> if someone does move rows then, this will not be changed (at least, not automatically) that works! however, perhaps we should again lock that column from editing and only allow HQ to do it - just so no-one does change these values?
you forgot the final _ before the final number. Can you add that please? Also, you need to update the definition just for source_mat_id: you can say "This identifier has to be unique to each row: it consists of the terms EMOBON_Observatory_Sample type_an iterator (1,2,3 etc), and this is created via a google equation. To add this identifier to a new row, simply drag the value from the cell above."
Yes I will but the equation you wrote above is without the and that is why I haven't put it."=CONCATENATE(observatory!$A$2,"",observatory!$R$2,"Wa","",????) "
woops :-} so sorry but actually, before you do anything, what do you think of christina and my suggestion about how to create the ID not from an equation but via putting in the first value and then just dragging down, but locking the column also? @melanthia WDYT?
I agree with Katrina. I think we should minimise the risk of false or inaccurate IDs by all means in the future.
ok!So i will make the source_mat_id column to be just values with increasing number.
yes, and then @cpavloud or @isanti can lock the column so that only ioulia can make that in the future (it is annoying to have this manual step, but unfortunately necessary). Remember to change this for Wa, So, and ARMS and also change the description - adding the info that "this column can only be filled by EMO BON HQ"
done for wa and so. on arms what format we want? for example: EMOBON_Koster_VH1_Ha_210608_220608_SF40_1 Do I keep it like this?
no, change it in the same way but use Ha instead of Wa or So
Done. you can check 1-2 to be sure that they are properly changed.
well, yes and no - it is correct, however the fact is that most of these logsheets only have info from Koster in there - as an example - so you would not change just one column for those examples. However, @JustinePa will be removing those examples and putting in whatever information she has for those observatories that do have info. after that, we should tidy up the logsheets which only have the example entries in them. So @melinalou, yes, it is OK.
since we have not decided to go for UUIDs (which is the title of this issue), at least not in the logsheets, I think we can close this issue?
Ok!
The case for replacing source_mat_ids with UUIDs as the unique identifier of sampling events
Sampling events in EMO BON are a primary data point to which other data are generated from, and attached to. As such sampling events must be uniquely identified so that data are not attached to the wrong sampling event. As originally conceived, EMO BON uses the "source_mat_id" as the unique identier, that is automatically constructed from 4 different fields of the sample metadata. This is what is known as a natural key. As we have seen already in the previous operation of EMO BON this has caused issues with duplicate "source_mat_id"s and malformed, or mis-translated ids from one sheet to another - some 90 or so key's were missing from the run-information-batch-XXX.csv sheets (the shipping information which is used to link sampling events to DNA data). Fields can also be updated causing the key to change and the chain of provenance for sample event to be lost.
These are really two separate issues:
There are two ways to go here:
we tighten the procedure for generating the "source_mat_id"s to ensure they are unique. This may not be possible because of the way that Google Sheets work. For instance, if we can add a verification check that the "source_mat_id" is unqiue in the field, how does that effect the data used to format key in the other fields? We may be asking too much of Google Sheets here, but it could be checked to see if it works.
an alternative would be to use UUIDs. UUIDs can be used "to identify something with near certainty that the identifier does not duplicate one that has already been used".
This solves the uniqueness problem, but could be seen as less portable if requiring human intervention. However, I would assume no one will ever be required to write a UUID or source_mat_id by hand and that it is always done by copy/paste if needed. So the actual risk seems no different.
UUIDs can be automatically added to Google Sheets when row data is entered.
Once the key is generated it would need to be used throughout other EMO BON procedures (they would replace the source_mat_ids and make them superfluous), which will be fine if the procedure is automated, and no worse than what is done now if requiring human intervention.