Closed cymon closed 2 months ago
@cymon , can you add the link here to where you got the IDs for the first column, and explain what you mean what are in the following columns and which files they come from? that makes it easier to diagnose the problem
The first column of IDs are the emo-bon source_mat_ids (the primary identifier) from the Batch 1 and 2 run_information sheets: batch 1 batch 2
The next three are close matches using the python difflib.get_close_matches() function on the 4047 valid souce_mat_ids from the logsheets.
'Total number of all_source_mat_ids_from_sheets: 4047' (All the source_mat_ids from all sheets "sampling" and "measured" from all observatories that match the correct source_matid format, specifically, len(value.split("")) < 6) A total of 90 source_mat_ids in the batch 1 & 2 run information sheets are missing from the observatory sampling sheets (ie are not in the 4047 valid source_mat_ids in the sheets) Total combined_events 138 + 90 = 228 and should be equal to total number of refcodes assigned in the run information sheets 227 (no idea why yet)
From the logsheets in GH in the observatories (which are old), your newer ones in GH, or from the googledrive itself?
All of the "sampling" and "measured" data were harvested directly from the observatories Google Sheets; I used the links provided in
emo-bon/governance/logsheets.csv
to identify the sheets. The so the "newer" ones I put on GH are derived directly from the observatory Google Sheets.
Can the emo-bon/observatory-{observatory_id}-crate/main/logsheets
be updated? They would be a better starting point than the raw Google Sheets.
That is for @marc-portier to solve - the pipeline that gets those does the QC and other stuff also and if it fails anywhere along the way, nothing gets harvested. Marc is looking to bypass this code until Bram can get back to fix it, so you have to ask him.
That is for @marc-portier to solve - the pipeline that gets those does the QC and other stuff also and if it fails anywhere along the way, nothing gets harvested. Marc is looking to bypass this code until Bram can get back to fix it, so you have to ask him.
Only "Bergen" appears to be missing a "transformed" data sheet, but those that are transformed may not be very up to date, as you say.
indeed, all are old - very pre-summer-QC work
The source_mat_ids have changed in the logsheeets, so these errors are no longer relevant.
Missing source_mat_ids - or how the source_mat_ids in the Batch run information sheets do not match the source_mat_ids in the Google logsheets
These are the emo-bon source_mat_ids (the primary identifier) from the Batch 1 and 2 run_information sheets that are missing from the Google logsheets. The leftmost code is the one found in the Batch 1 and 2 run_information sheets which are presumably the correct ones as they are manually curated by Ioulia. The next three are close matches using the python difflib.get_close_matches() function.
Most have obvious matches where 200um has been changed to 0.2um, but some are less clear