AMI-system / gbif-species-trainer-AMI-fork

Code for training a fine-grained species classification model using data from GBIF
MIT License
0 stars 0 forks source link

Access the Darwin Core Archives file for UKSI #1

Closed LevanBokeria closed 1 year ago

LevanBokeria commented 1 year ago

Rolnick lab recently updated the codebase, reflected in the most recent commits to their repo, which is also forked here.

In the attempt to replicate their model, I was trying to download the data following their instructions, and step 2 requires using a Darwin Core Archive (DwC-A) file exported from GBIF.

In another issue (#4) I found the website for UKSI, but if I click "Download" and then chose the Dawin Core Archive file, the website tells me I don't have permission for that database.

UKSI website:

image

Forbidden access:

image

@DavidRoy is this something you might have access to?

If the DwC-A file is just not available for UKSI, then we can always revert back to the old way of downloading GBIF data, which Kat has already worked on. But would be good to have the DwC-A file to replicate the full pipeline of Rolnick lab.

DavidRoy commented 1 year ago

@lbokeria uksi is the checklist, I.e. the list of species we want to include in the model. The gbif Darwin core is the download of the images, for the list of species. There is no geographic filter for the images as the image should be useful wherever they have come from. Does that make sense?

LevanBokeria commented 1 year ago

Thanks David! Hmm, just to clarify a few things:

The current instructions on downloading data from Rolnick lab state that the download scripts need a "species checklist" file and separately a DwC-A file exported from GBIF. See the screenshot below:

image

Its clear how to get the species checklist, by giving the uksi-macro-moths.csv file to the 01-fetch_taxon_keys.py file which then gives an appropriately formatted species checklist with the necessary info.

Regarding the DwC-A file, from your reply I gather that file contains all of the image data as well as associated core and extension files (as far as I understand how such archives are organized). The Rolnick lab code just needs the multimedia.txt and occurence.txt files from this archive, not the full archive. See the screenshot from the code below:

image

I thought I would get those files by going to the UKSI webpage and trying to download the DwC-A file, but as I have mentioned above I don't have access.

I hope I've clarified where I'm stuck. Sorry if its still not clear, or I'm misunderstanding something about these files, I'm still trying to learn how data is organized and spread in this field! :)

Thanks a lot!

DavidRoy commented 1 year ago

@lbokeria My understanding is the you run this code in step 2, giving the uksi-macro-moths.csv file which defines the species of interest. Then you specific the path and filename for the results files, i.e. the code generates these

python 02-fetch_gbif_moth_data.py \ --write_directory / \ --dwca_file /[filename].zip --species_checklist /uksi-macro-moths.csv]\ --max_images_per_species 500 \ --resume_session True

DavidRoy commented 1 year ago

@albags might be able to help

LevanBokeria commented 1 year ago

Unfortunately, the dwca_file must already exist, its required by the code. It will not create one. Happy to chat more during the meeting today.

DavidRoy commented 1 year ago

I've had a quick look at the code and it's not clear to me where the dwca_file comes from. Perhaps a manual download from GBIF somehow. Let's ask later unless @albags knows.

DavidRoy commented 1 year ago

@LevanBokeria I understand this better now. We need to download the lepidoptera occurrences ourselves via GIB (or use the download from Aditya). The approach is:

The file is ~30Gb! Do you want me to download and share with you

LevanBokeria commented 1 year ago

Thanks David! Downloading this now. I will post if I run into any issues.

LevanBokeria commented 1 year ago

I have finally downloaded the full dataset, so will close this issue.

However, since the dwca file for lepidopera is huge (30GP zipped), the scripts are taking a really long time to open it. I wonder if its possible to make this process more efficient. The 02-fetch_gbif_moth_data.py script only seems to need the multimedia.txt and occurrence.txt files from the archive, which are presumably much smaller. This might be an "issue" to open up on the Rolnick lab repo, to discuss with them.

DavidRoy commented 1 year ago

it would be good to understand how the occurrences data is used in the model training.