NHMDenmark / DanSpecify

Important files regarding the Danish instance of the Specify database system for collections digitisation and management, plus placeholder for issue tracking. Guidelines, manuals and other kinds of documentations will be gathered on the wiki.
3 stars 2 forks source link

Butterflies import (Zooniverse) #51

Open adjordan74 opened 3 years ago

adjordan74 commented 3 years ago

Metadata som ønskes importeret til Specify inkl fotos af de danske køllesværmere (ca 3500). Alle fotos er navngivet efter NHMD nummer. Foto ikke vedhæftet, men kan fremsendes i high res og low res. Dbh anders 52 82 69 82

flatten_class_butterflies_sorted.xlsx reconciled_butterflies.xlsx

FedorSteeman commented 3 years ago

Anders Illum og jeg har gennemgået numrene og besluttet os for at de numre, som blev snuppet af andre projekter (fx 282 til myreløve-registreringen), bliver henført til nye numre. De katalognumre mellem 222901 til 223290 vil fremover løbe fra 308334 til 308800 og disse skal vi rette (eller tilføje) i regnearken, som bliver vedhæftet her.

Andersillum commented 3 years ago

Så er der tilføjet nye NHMD numre til excel filen, til de individer der havde overlap. flatten_class_butterflies_sorted_new_NHMD_numbers.xlsx

adjordan74 commented 3 years ago

Kære Fedor og anders

Tak for listen og opdatering. To tanker.

1 Er det ikke bedre at Køllesværmerne beholder der oprindelige numre, nu hvor qr kode ligger i billede?

2 Det næste er, om der findes en batch-import funktion så vi kan få importere listen. Det ville være godt at få metadata ind, hvorefter materialet kan indgå i Gbif og portal.

Dbh anders

Andersillum commented 3 years ago

Hej Anders,

Problemet er at de andre dyr allerede er indskrevet i Specify, derfor er en stort opgave at give disse dyr nye numre. Det er en del nemmere at rette numrene i excel inden import end at rette de allerede indskrevne dyr individuelt i Specify. Jeg har allerede fjernet de gamle NHMD numre og givet nye numre til de første 100 sommerfugle. Vi kan også tage nye fotos af etiketterne for de sommerfugle det drejer sig om, evt. med Dassco udstyret når det kommer.

Mvh. Anders

adjordan74 commented 3 years ago

Hej Anders Tak for svar - jeg tænkte nok at der lå noget mere besvær bag, men skulle lige høre. Ja nye billeder kan evt tages når Dassco lander. Super at du har givet dem nye numre. mvh Anders

Andersillum commented 3 years ago

Hvad er planen for at få ryddet op i regnearket? Jeg kan se at de fleste har 3 afskrivninger og andre har 15. Hvordan får vi lavet dem om til en afskrivning pr. individ?

adjordan74 commented 3 years ago

Hej Anders. Jeg har en summary fil i et html-format (en afskrivning pr individ) som jeg skal se om vi kan exportere til til CSV. Vil lige se hvad der kan gøre denne øvelse mest smertefri.

FedorSteeman commented 3 years ago

Jeg vil meget gerne bruge dette materiale som test case for at bulk-importere billeder og manualt linke dem til records. Jeg vil lige lave nogle forsøg i sandkasse-databasen. Jeg ville starte med den første batch uden overlap (219401-222900). Jeg kan se at disse allerede forefindes i databasen som stort set tomme records, så jeg overvejer følgende fremgangsmåde:

  1. Klargør import fil ud fra regnearken
  2. Slet de i forvejen eksisterende tomme rækker så de ikke spærre for import
  3. Importer vha WorkBench
  4. Sorg før der findes eller produceres thumbnails for hvert eneste billede
  5. Læg alle billeder samt thumbnails på medieserveren i deres respektive mapper
  6. Køre en manual databaseopdatering, hvor jeg linker billedstierne til de nyimporterede records
  7. Tjek i Specify om medierne nu dukker op for de pågældende records

Jeg skal have fat i medierne. @adjordan74 har du lagt dem på et I:-drev eller kan du sørge for at dette sker? Jeg kan godt hjælpe til, da det nok er lidt udfordrende mht filernes størrelse.

adjordan74 commented 3 years ago

Hej Fedor, Det lyder godt, - eneste udfordring er at få strømlinet data-arket, og spørgsmålet er om det kan scriptes eller skal håndholdes.

Mediefiler: Men alle mediefiler ligger på I-drevet Samling-Image-Repro (I folderen “Sertifer”). Heri ligger de navngivet efter deres NHMD nummer (hvor de fleste således matcher).

dbh anders

FedorSteeman commented 3 years ago

Hej Anders,

Jeg finder nok ud af noget.

Fandt filerne! Det ser ud til at være thumbnails i forvejen, så skal nok lave et script der kopierer filerne over i de respektive mapper.

For øvrigt: Hvis alle dem der har determineret skal anføres i databasen skal jeg nok have en måde at linke deres oftest finurlige "brugernavne" (se vedhæftede regneark) til ægte navne og importere disse først som agenter. Hvis ikke det er nødvendigt, så er jeg helst fri, for det komplicerer en del.

Determiners.xlsx

Andersillum commented 3 years ago

Hej Fedor,

Er du opmærksom på at Specify har et ganske udmærket Attachment tool til at uploade vedhæfte store filmængder? De har lavet en fin video der beskriver processen her: https://vimeo.com/125299185 batch upload starter omkring 14:30

Er der behov for at uploade thumbnails? Specify genere selv thumbnails og vil portalen ikke også gøre det?

Mvh. Anders

FedorSteeman commented 3 years ago

Nej, det var jeg ikke! Endnu et hjørne af Specify, som jeg ikke kom rundt i... Det ser lovende ud, ikke mindst iht andre tickets nemlig #74 #8 og #54

Tak! Kigger på det!

Andersillum commented 3 years ago

Hej Anders,

Nu har jeg været igennem alle sommerfuglene, som skulle have nye numre. Den sidste sommerfugl i kassen var ikke i regnearket, men den havde fået et NHMD nummer som var NHMD-223233. Det drejer sig om en Adscita statices - Grøn køllesværmer. Kan du/I se om der er blevet taget fotos af den? Jeg har givet den et nyt NHMD nummer som er 308666.

Ellers ser alt ud til at stemme :-)

Mvh. Anders I

adjordan74 commented 3 years ago

Den kommer her. Jeg er lige nu ved at undersøge hvordan OpenRefine kan klargøre datasæt. Hvis I kender det program kan vores data nok hurtigst klargøres og eksporteres til GBIF.

Dbh Anders

FedorSteeman commented 3 years ago

@Andersillum Jeg slipper ikke helt for at scripte, kan jeg se, for at klargøre filerne til import vha Specify. De skal jo flyttes til samme mappe og omdøbes, så de ikke overskriver hinanden. Men det finder jeg nok ud af! 😁

FedorSteeman commented 3 years ago

[@adjordan74] Du skal lige være opmærksom på, at jeg vil køre et script, der vil omdøbe samtlige filnavne i mappen "Sertifer" til at inkludere katalognummer for at lette import. Dvs filerne omdøbes fra "Image001.tif" til "NHMD-219401-Image001.tif" osv.

Det burde ikke ødelægge noget, men jeg går ud fra vi altid kan hente dem igen fra kilden indtil videre?

[@adjordan74] For øvrigt: Hvis alle dem der har determineret skal anføres i databasen skal jeg nok have en måde at linke deres oftest finurlige "brugernavne" (se vedhæftede regneark) til ægte navne og importere disse først som agenter. Hvis ikke det er nødvendigt, så er jeg helst fri, for det komplicerer en del. Determiners.xlsx

FedorSteeman commented 3 years ago

The images are now ready for import, but we still need to clean the spreadsheet with metadata. As agreed with @adjordan74. I will assign @Sosannah to this task for her to start working on as soon as she can.

Sosannah commented 3 years ago

Status report: (TLTR) At this point I'm done with several rounds of data cleaning, and tried a test upload to Specify.

Zooniverse_butterflies_200521.xlsx

There are still a lot of dirt in the Collectors and Localities columns, and they cannot be automatized too much, so it's a trade-off between working hours and data purity. I've spent a lot of time with it, but it's still far from perfect. OpenRefine worked the best for me with the columns above, but it is still a tedious process.

At a meeting today with @Andersillum, @TinaThuno and Jan Pedersen, we found further issues with the dataset - some of the entries have several dates on the labels (date of collecting and cataloging), which led to misunderstandings and errors in the data entry.

NHMD-219405-Preview002 NHMD-219405-Preview004 incorrect_date

Due to the issues presented above, we (Anders, Jan, Tina and myself) decided to scan through thoroughly the first 100 entries (checking also the images) to figure it out how many errors we can find and how much time it needs to check and correct them.

However, since there are almost 3500 objects are already on Specify (created by Martin Stein) with minimal information (mainly the species name), we have 3 options in my opinion:

  1. Remove the old data with almost no information and upload the cleaned, but still dirty dataset and the images now.

  2. Clean the dataset precisely and wait for the image upload until it's done.

  3. Keep the minimal data and upload the images now, then when we are done with the cleaning, replace the old dataset with the new one.

We can wait with the decision after the check-up of first 100 entries.

What is your opinion?

Mvh, Zsuzsanna

adjordan74 commented 3 years ago

Dear Zsuzsanna and Anders

Thanks for this status and good work.

would there be a 4 option (upload images and data set aware of issues), this would allow to test the material while still populating the prototype with external contractor.

I'm aware that we may need to still curate or adjust data.

@fedor, pls weigh in if its “hassle-some” or way forward.

Thanks

Anders

Anders Drud Jordan, Ph.D. Head of design + digital

NATURAL HISTORY MUSEUM of DENMARK Faculty of Science Gothersgade 130 1123 KBH K Denmark

+4551826982 @.**@.> statensnaturhistoriskemuseum.dk

Andersillum commented 3 years ago

Hi @adjordan74,

We have more than enough mess in the entomological database already that needs to be cleaned. We should not add more! If you want the external contractors to play with the data and images please upload it to a sandbox, but not the entomological database before it has been cleaned. If we first upload now to specify our experience have shown us that it will never be cleaned.

If we know that the data is messy even extremely faulty and therefore unusable we shouldn’t upload it before it has been cleaned.

I know it is a cool project with help of citizen science, but it should not be the reason to import faulty and extremely messy data to Specify!

Mvh. Anders

FedorSteeman commented 3 years ago

I have seen the dataset, talked with Zsuzsanna and @Andersillum and generally agree: The data is very dirty and I trust @Andersillum 's assessment that it is too dirty for import.

Still, it would be nice with at least some of those cool moth pictures in the database. Option 3 suggested by @Sosannah (batch uploading the images to the existing minimal information records) could be considered.

We could also try to pick as many of the cleanest records from the dataset and upload those.

We could also stick our heads together and manually clean as many as possible in an hour-long session.

TinaThuno commented 3 years ago

Dear all,

I agree with Anders. It makes no sense to upload data that is so deficient despite Zsusanna's extensive work on the dataset. Especially the localities will be a very big problem. When a location is listed as "Jens Har Lilkebong", it makes no sense for anyone and there are unfortunately a lot of these "alternative" locations.

Perhaps the best alternative is to select a smaller number of specimens from the dataset as Fedor suggests, but as it is now, these will still need work before publishing.

adjordan74 commented 3 years ago

Hi @FedorSteeman @TinaThuno @Andersillum @Sosannah

Agree to above, and I think the option of 1) take out a section could make a way forward.

Could we try to circle a smaller section and clean/evaluate the geo locations, dates. Im aware that some labels did not contain day/month/year (now listed as '00.00.9999').

With this at hand we can extrapolate and getter a better feel of the remaining work.

Shall we make a short zoom call and: 1) settle on a test number 2) discuss a consistent way to demarcate dates and geo refs (when they are either absent or non meaningful).

Best Anders

Sosannah commented 3 years ago

Hi @FedorSteeman @TinaThuno @Andersillum @adjordan74 ,

yes, we can have a short Zoom meeting, @adjordan74 would you arrange one?

Anyway, we've already suggested to take the first 100 entries and check them (as a pilot project). :)

bests, Zs.

adjordan74 commented 3 years ago

Hi Zs.

I can do

1) 12:30 today

2) 09:00 (tuesday 25th)

3) 10:00 (wednesday, 26th).

Cheers

Sosannah commented 3 years ago

For me: 1) 12:30 today or 3) 10:00 (wednesday, 26th).

TinaThuno commented 3 years ago

I can’t today. Wednesday is fine.

Sosannah commented 3 years ago

First 100 entries along with their images are on Specify.

adjordan74 commented 3 years ago

GREAT!

thanks Anders

FedorSteeman commented 3 years ago

I will initiate an export update and subsequent ingestion on GBIF UAT to see if they come through.

FedorSteeman commented 3 years ago

Looks like they came out pretty quickly: https://www.gbif-uat.org/occurrence/gallery?q=statices&dataset_key=0879ca9b-0234-4441-84fb-645b8c7a448e

FedorSteeman commented 3 years ago

If all is done with this one, I suggest we move it to the backlog to revisit later, once we cleaned the rest of the metadata.

Sosannah commented 3 years ago

We can do it. The next batch is coming in the next weeks - 50 specimens from each species in the database.

FedorSteeman commented 3 years ago

Oh, OK. If more batches of different species are still coming in, just keep it in progress.

Sosannah commented 3 years ago

The next batch (50 specimens from each species in the database) has been uploaded to Specify/GBIF/Open Collection. I move this case to the backlog - we can reopen it after the rest of the metadata is cleaned.

adjordan74 commented 3 years ago

Hi Zsuzsanna

Great, thanks for the note. Best Anders

Sosannah commented 5 months ago

@TinaThuno has cleaned a huge part of the metadata - with the help of Jan and Anders. Zooniverse_butterflies_TT_JP_AI_111023_Finale.xlsx

Sosannah commented 5 months ago

After a meeting with @Andersillum and Jan, the following issues/tasks were relieved: