AMI-system / gbif-species-trainer-AMI-fork

Code for training a fine-grained species classification model using data from GBIF
MIT License
0 stars 0 forks source link

Download UK moths images using updated Rolnick codebase #9

Closed LevanBokeria closed 1 year ago

LevanBokeria commented 1 year ago

List provided in: AMI-trap/on_device_classifier#4

Using the updated code to download data with the DwCA files from GBIF.

The code provided by Rolnick lab needed quite a few changes which will be documented.

LevanBokeria commented 1 year ago

Currently, completed downloading 1000 images for the UKSI macro moths list provided by David in AMI-trap/on_device_classifier#4.

4 species come up as unavailable with the GBIF API, related to the same problem faced by Katriona in the same issue. But this will be fixed in the next iteration of the download code.

So far, from 999 macro moth species provided, 4 are unavailable with GBIF API and 5 are synonyms of other species in the same list, resulting in 990 species.

Of 990, 456 have all 1000 images, while 287 have 0 images. The rest of the 247 species are somewhere in-between:

Image

Attached is a CSV file with the 990 species and their image counts: species_image_count.csv

Images located in /bask/projects/v/vjgo8416-amber/data/gbif-species-trainer-AMI-fork/gbif_images/

Outstanding issues and topics:

LevanBokeria commented 1 year ago

@DavidRoy I am downloading images for all UK moths, not just the macro ones.

I have taken the list you provided originally here, and matched the names to the GBIF backbone.

I am sharing the resulting CSV file as a google sheet. I wanted to double check two things:

This google sheet is slightly different from what Aditya and Fagner produce with their code. I have additionally included columns for reference:

My code first searches for a match on GBIF using the "species_name_provided" column. If no match is found, it then combines the species name with the authority name, and searches using that combined string. That is why sometimes the column "search_species_name" has the authority and sometimes it does not.

I will post similar google sheets for Singapore and Costa Rica.

LevanBokeria commented 1 year ago

The download for UKSI checklist is complete. The checklist in the google drive folder called data_stats_uksi-moths-keys.csv contains the list of species with the last column showing how many images we have per species. I requested 1000 images per species but GBIF does not have this many for lots of them.

As a brief summary, our of 2690 species, 565 have 1000 images and 126 have 0 images. The rest are in between. The attached histogram (binned at 20) gives an idea of distribution of image counts. Lots of species have between 0-20 images and between 980-1000 images, as shown by the height of those bars.

Image