Closed llivermore closed 3 years ago
Hi @benscott After the meeting today it is our understanding is that this is no longer happening - the ICEDIG gold standard dataset i.e. https://zenodo.org/record/3697797#.YS-FSY5KiUk (containing herbarium specimens a number of Institutions - inc, RBGE) is the dataset that is going to be used and we do not need to upload specimens directly from RBGE.
I do understand that in the future @martinteklia may need more specimens than 1800 in this dataset in which case we have more well-transcribed specimens "easily" available and can upload these to this location.
The gold standard herbarium dataset is remarkably clean and free of the structural and other errors typical of DwC datasets: https://discourse.gbif.org/t/100-gbif-datasets-improved/3042 How do you see machine learning working with messy datasets? One of these? (1) ML will work with datasets only after HitL cleaning (2) ML will work with uncleaned datasets and problems will be referred to HitL (3) ML will abort working with uncleaned datasets and these will be set aside for future processing
@Mesibov our main focus is your second described approach:
(2) ML will work with uncleaned datasets and problems will be referred to HitL
There will be some instances where we do not refer to any subsequent HitL process and publish the data, but make it explicit that some of the results are from an ML process and do not meet a threshold for trustworthiness, or may require more scrutiny before use. An example: if we are OCRing labels, even messy OCR data is useful for searching for internal/institutional and external users.
@Cubey0 I am closing this issue as this is now a duplicate of the work being done in #7
@Cubey0 we have created a separated repo for datasets: https://github.com/DiSSCo/sdr-datasets
Can you upload your herbarium sheet dataset there?
Contact @benscott if you need any guidance :)