Open adjordan74 opened 3 years ago
Anders Illum og jeg har gennemgået numrene og besluttet os for at de numre, som blev snuppet af andre projekter (fx 282 til myreløve-registreringen), bliver henført til nye numre. De katalognumre mellem 222901 til 223290 vil fremover løbe fra 308334 til 308800 og disse skal vi rette (eller tilføje) i regnearken, som bliver vedhæftet her.
Så er der tilføjet nye NHMD numre til excel filen, til de individer der havde overlap. flatten_class_butterflies_sorted_new_NHMD_numbers.xlsx
Kære Fedor og anders
Tak for listen og opdatering. To tanker.
1 Er det ikke bedre at Køllesværmerne beholder der oprindelige numre, nu hvor qr kode ligger i billede?
2 Det næste er, om der findes en batch-import funktion så vi kan få importere listen. Det ville være godt at få metadata ind, hvorefter materialet kan indgå i Gbif og portal.
Dbh anders
Hej Anders,
Problemet er at de andre dyr allerede er indskrevet i Specify, derfor er en stort opgave at give disse dyr nye numre. Det er en del nemmere at rette numrene i excel inden import end at rette de allerede indskrevne dyr individuelt i Specify. Jeg har allerede fjernet de gamle NHMD numre og givet nye numre til de første 100 sommerfugle. Vi kan også tage nye fotos af etiketterne for de sommerfugle det drejer sig om, evt. med Dassco udstyret når det kommer.
Mvh. Anders
Hej Anders Tak for svar - jeg tænkte nok at der lå noget mere besvær bag, men skulle lige høre. Ja nye billeder kan evt tages når Dassco lander. Super at du har givet dem nye numre. mvh Anders
Hvad er planen for at få ryddet op i regnearket? Jeg kan se at de fleste har 3 afskrivninger og andre har 15. Hvordan får vi lavet dem om til en afskrivning pr. individ?
Hej Anders. Jeg har en summary fil i et html-format (en afskrivning pr individ) som jeg skal se om vi kan exportere til til CSV. Vil lige se hvad der kan gøre denne øvelse mest smertefri.
Jeg vil meget gerne bruge dette materiale som test case for at bulk-importere billeder og manualt linke dem til records. Jeg vil lige lave nogle forsøg i sandkasse-databasen. Jeg ville starte med den første batch uden overlap (219401-222900). Jeg kan se at disse allerede forefindes i databasen som stort set tomme records, så jeg overvejer følgende fremgangsmåde:
Jeg skal have fat i medierne. @adjordan74 har du lagt dem på et I:-drev eller kan du sørge for at dette sker? Jeg kan godt hjælpe til, da det nok er lidt udfordrende mht filernes størrelse.
Hej Fedor, Det lyder godt, - eneste udfordring er at få strømlinet data-arket, og spørgsmålet er om det kan scriptes eller skal håndholdes.
Mediefiler: Men alle mediefiler ligger på I-drevet Samling-Image-Repro (I folderen “Sertifer”). Heri ligger de navngivet efter deres NHMD nummer (hvor de fleste således matcher).
dbh anders
Hej Anders,
Jeg finder nok ud af noget.
Fandt filerne! Det ser ud til at være thumbnails i forvejen, så skal nok lave et script der kopierer filerne over i de respektive mapper.
For øvrigt: Hvis alle dem der har determineret skal anføres i databasen skal jeg nok have en måde at linke deres oftest finurlige "brugernavne" (se vedhæftede regneark) til ægte navne og importere disse først som agenter. Hvis ikke det er nødvendigt, så er jeg helst fri, for det komplicerer en del.
Hej Fedor,
Er du opmærksom på at Specify har et ganske udmærket Attachment tool til at uploade vedhæfte store filmængder? De har lavet en fin video der beskriver processen her: https://vimeo.com/125299185 batch upload starter omkring 14:30
Er der behov for at uploade thumbnails? Specify genere selv thumbnails og vil portalen ikke også gøre det?
Mvh. Anders
Nej, det var jeg ikke! Endnu et hjørne af Specify, som jeg ikke kom rundt i... Det ser lovende ud, ikke mindst iht andre tickets nemlig #74 #8 og #54
Tak! Kigger på det!
Hej Anders,
Nu har jeg været igennem alle sommerfuglene, som skulle have nye numre. Den sidste sommerfugl i kassen var ikke i regnearket, men den havde fået et NHMD nummer som var NHMD-223233. Det drejer sig om en Adscita statices - Grøn køllesværmer. Kan du/I se om der er blevet taget fotos af den? Jeg har givet den et nyt NHMD nummer som er 308666.
Ellers ser alt ud til at stemme :-)
Mvh. Anders I
Den kommer her. Jeg er lige nu ved at undersøge hvordan OpenRefine kan klargøre datasæt. Hvis I kender det program kan vores data nok hurtigst klargøres og eksporteres til GBIF.
Dbh Anders
@Andersillum Jeg slipper ikke helt for at scripte, kan jeg se, for at klargøre filerne til import vha Specify. De skal jo flyttes til samme mappe og omdøbes, så de ikke overskriver hinanden. Men det finder jeg nok ud af! 😁
[@adjordan74] Du skal lige være opmærksom på, at jeg vil køre et script, der vil omdøbe samtlige filnavne i mappen "Sertifer" til at inkludere katalognummer for at lette import. Dvs filerne omdøbes fra "Image001.tif" til "NHMD-219401-Image001.tif" osv.
Det burde ikke ødelægge noget, men jeg går ud fra vi altid kan hente dem igen fra kilden indtil videre?
[@adjordan74] For øvrigt: Hvis alle dem der har determineret skal anføres i databasen skal jeg nok have en måde at linke deres oftest finurlige "brugernavne" (se vedhæftede regneark) til ægte navne og importere disse først som agenter. Hvis ikke det er nødvendigt, så er jeg helst fri, for det komplicerer en del. Determiners.xlsx
The images are now ready for import, but we still need to clean the spreadsheet with metadata. As agreed with @adjordan74. I will assign @Sosannah to this task for her to start working on as soon as she can.
Status report: (TLTR) At this point I'm done with several rounds of data cleaning, and tried a test upload to Specify.
Zooniverse_butterflies_200521.xlsx
There are still a lot of dirt in the Collectors and Localities columns, and they cannot be automatized too much, so it's a trade-off between working hours and data purity. I've spent a lot of time with it, but it's still far from perfect. OpenRefine worked the best for me with the columns above, but it is still a tedious process.
At a meeting today with @Andersillum, @TinaThuno and Jan Pedersen, we found further issues with the dataset - some of the entries have several dates on the labels (date of collecting and cataloging), which led to misunderstandings and errors in the data entry.
Due to the issues presented above, we (Anders, Jan, Tina and myself) decided to scan through thoroughly the first 100 entries (checking also the images) to figure it out how many errors we can find and how much time it needs to check and correct them.
However, since there are almost 3500 objects are already on Specify (created by Martin Stein) with minimal information (mainly the species name), we have 3 options in my opinion:
Remove the old data with almost no information and upload the cleaned, but still dirty dataset and the images now.
Clean the dataset precisely and wait for the image upload until it's done.
Keep the minimal data and upload the images now, then when we are done with the cleaning, replace the old dataset with the new one.
We can wait with the decision after the check-up of first 100 entries.
What is your opinion?
Mvh, Zsuzsanna
Dear Zsuzsanna and Anders
Thanks for this status and good work.
would there be a 4 option (upload images and data set aware of issues), this would allow to test the material while still populating the prototype with external contractor.
I'm aware that we may need to still curate or adjust data.
@fedor, pls weigh in if its “hassle-some” or way forward.
Thanks
Anders
Anders Drud Jordan, Ph.D. Head of design + digital
NATURAL HISTORY MUSEUM of DENMARK Faculty of Science Gothersgade 130 1123 KBH K Denmark
+4551826982 @.**@.> statensnaturhistoriskemuseum.dk
Hi @adjordan74,
We have more than enough mess in the entomological database already that needs to be cleaned. We should not add more! If you want the external contractors to play with the data and images please upload it to a sandbox, but not the entomological database before it has been cleaned. If we first upload now to specify our experience have shown us that it will never be cleaned.
If we know that the data is messy even extremely faulty and therefore unusable we shouldn’t upload it before it has been cleaned.
I know it is a cool project with help of citizen science, but it should not be the reason to import faulty and extremely messy data to Specify!
Mvh. Anders
I have seen the dataset, talked with Zsuzsanna and @Andersillum and generally agree: The data is very dirty and I trust @Andersillum 's assessment that it is too dirty for import.
Still, it would be nice with at least some of those cool moth pictures in the database. Option 3 suggested by @Sosannah (batch uploading the images to the existing minimal information records) could be considered.
We could also try to pick as many of the cleanest records from the dataset and upload those.
We could also stick our heads together and manually clean as many as possible in an hour-long session.
Dear all,
I agree with Anders. It makes no sense to upload data that is so deficient despite Zsusanna's extensive work on the dataset. Especially the localities will be a very big problem. When a location is listed as "Jens Har Lilkebong", it makes no sense for anyone and there are unfortunately a lot of these "alternative" locations.
Perhaps the best alternative is to select a smaller number of specimens from the dataset as Fedor suggests, but as it is now, these will still need work before publishing.
Hi @FedorSteeman @TinaThuno @Andersillum @Sosannah
Agree to above, and I think the option of 1) take out a section could make a way forward.
Could we try to circle a smaller section and clean/evaluate the geo locations, dates. Im aware that some labels did not contain day/month/year (now listed as '00.00.9999').
With this at hand we can extrapolate and getter a better feel of the remaining work.
Shall we make a short zoom call and: 1) settle on a test number 2) discuss a consistent way to demarcate dates and geo refs (when they are either absent or non meaningful).
Best Anders
Hi @FedorSteeman @TinaThuno @Andersillum @adjordan74 ,
yes, we can have a short Zoom meeting, @adjordan74 would you arrange one?
Anyway, we've already suggested to take the first 100 entries and check them (as a pilot project). :)
bests, Zs.
Hi Zs.
I can do
1) 12:30 today
2) 09:00 (tuesday 25th)
3) 10:00 (wednesday, 26th).
Cheers
For me: 1) 12:30 today or 3) 10:00 (wednesday, 26th).
I can’t today. Wednesday is fine.
First 100 entries along with their images are on Specify.
GREAT!
thanks Anders
I will initiate an export update and subsequent ingestion on GBIF UAT to see if they come through.
Looks like they came out pretty quickly: https://www.gbif-uat.org/occurrence/gallery?q=statices&dataset_key=0879ca9b-0234-4441-84fb-645b8c7a448e
If all is done with this one, I suggest we move it to the backlog to revisit later, once we cleaned the rest of the metadata.
We can do it. The next batch is coming in the next weeks - 50 specimens from each species in the database.
Oh, OK. If more batches of different species are still coming in, just keep it in progress.
The next batch (50 specimens from each species in the database) has been uploaded to Specify/GBIF/Open Collection. I move this case to the backlog - we can reopen it after the rest of the metadata is cleaned.
Hi Zsuzsanna
Great, thanks for the note. Best Anders
@TinaThuno has cleaned a huge part of the metadata - with the help of Jan and Anders. Zooniverse_butterflies_TT_JP_AI_111023_Finale.xlsx
After a meeting with @Andersillum and Jan, the following issues/tasks were relieved:
Metadata som ønskes importeret til Specify inkl fotos af de danske køllesværmere (ca 3500). Alle fotos er navngivet efter NHMD nummer. Foto ikke vedhæftet, men kan fremsendes i high res og low res. Dbh anders 52 82 69 82
flatten_class_butterflies_sorted.xlsx reconciled_butterflies.xlsx