Download code handover with Katriona

Broad summary:

The codebase is an adaptation of Rolnick Lab's Species Trainer repo.

Brief summary of changes:

Clarified instructions on obtaining dwca files.
Improved species list matching to the GBIF backbone, reducing the number of times species are not found.
Adapted and developed two ways of downloading the GBIF images: (1) using the whole dwca file and (2) preprocessing and splitting the dwca occurrence file, sidestepping the high RAM usage which makes code development and debugging manageable.
Improved metadata logging for image downloads, allowing faster re-download of images and incorporation of other variables such as corrupted or thumbnail images. The download code uses this metadata to skip already downloaded entries, broken URLs, or corrupted images which makes the code faster.
Improved the creation of data_statistics.csv files.
Downloaded images for UK and Singapore.
Discovered and documented potential taxonomic issue with Costa Rica species.

Known issues:

Costarica list:

Some species might have unclear taxonomy on GBIF. For example, "Nepheloleuca politia" and "Nepheloleuca illiturata" seem to be two different species, with difference "scientificName". However, in the occurrence dataframes, the "species" field for both of these are "Nepheloleuca politia".

Importantly, our download code (based on Rolnick lab), uses the "species" field to save images in appropriate folder hierarchy. Because of this, images from these two distinct species end up in the same folder, and the meta_data.json file no longer reflects the reality.

Solutions:

Perhaps GBIF data is wrong, and these should be the same species? So their "scientificName" should be same, not different. Else, they are different species but then its not clear why their "species" columns are the same.
Our download code should change. Images should be saved in folders created based on the "scientificName" variable and not the "species" variable. Alternatively, save everything in folders named after the unique taxonomic key of the species, the "acceptedTaxonKey" variable in our scripts.
Perhaps the whole approach of saving images under family/genus/species is wrong. Each species should have its own folder with their taxonomic id, and all these folders should be together in one folder. Metadata files should then specify for each image in each folder which family and genus they belong to.

GBIF updates:

I discovered that for Sessiidae species Pyropteron chrysidiformis, my old dwca file from August 2023 contains working URL links to images, but the updated dwca file from Octorber 2023 has no working URLs. So was the data ammended or deleted? How will this impact our database? How to update metadata.json to reflect this, or perhaps we should not reflect this?

Improvements:

pre-commits: flake8 has some exceptions that Kat might want to change. Black is not installed.
For fetch_taxon_keys.py save any errors and exceptions in a log file.
For fetch_images_split_dwca.py and fetch_images_whole_dwca.py:
- many of the subfunctions are the same. Could be taken out and imported. The function setup_logger is also the same in other scripts too.
- When creating meta data for an image, also save its URL.
- Catch non ascii characters in image URLs#11
- Download multiple media files for each occurrence on GBIF #8

AMI-system / gbif_download_standalone