Memory issues when downloading images using GBIF data

This issue was raised on the parent repo here - https://github.com/RolnickLab/gbif-species-trainer/issues/2

Documenting it here for the AMI fork repository, and expanding with the latest updates:

Problem description:

The GBIF dwca file for Lepidoptera is huge, about 30GB zipped. The script to download GBIF images using dwca files 02-fetch_gbif_moth_data.py runs out of memory and is automatically killed by slurm.

I tried running a smaller download, just for one family of moths Erebidae, which has a dwca zipped file of 1.5GB. But the process still exploaded in memory use and was killed.

The cause of the problem:

I investigated the 02-fetch_gbif_moth_data.py code, and the likely reason for such high RAM use is parallelisation of processes that happen on lines 187-188:

with Pool() as pool:
    pool.map(fetch_image_data, taxon_keys)

This happens after the dwca file is loaded and occurrence data is read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each taxon_keys the occurrence dataframe is duplicated. This is likely resulting in astronomical memory usage.

Additionally, the dwca files which are read into memory using the dwca-reader package take really large amount of space. The main culprit is the occurrence.txt file. When read into memory, these dataframes take unexpectedly large amount of space. For example, a zipped dwca file of ~200MB contains a ~1GB occurrence.txt file, but when read into memory consumes ~3.5GB of RAM.

Proposed solution:

I have turned of parallelisation in the download code, so the images for each species are downloaded serially. Takes a lot more time. I have downloaded separate dwca files for the 16 moth families provided by David Roy in the initial moths checklist. I have written a wrapper python file which takes the family-specific dwca file as an argument and will call the functions from the 02-fetch_gbif_moth_data.py file.

Will make a new branch and open a PR to document these changes.

To document progress made on this front over the last months:

With the big lepidoptera.zip file, the process used to get killed because the dwca reader package was trying to unzip the contents in a temporary directory, and that temporary directory used to be assigned 100GB maximum space, while the .zip file contents were more than that.

My initial solution involved passing a custom temporary folder to extract dwca file contents to. But this took a lot of time and each time wrote 200GB+ worth of files to disk. I later discovered that the dwca reader package can also read already extracted dwca files. So the solution involves manually extracting the lepidoptera.zip file once in a folder in our project directory (make sure you have 200GB+ space available), and then point that directory to the dwca reader package. This avoids having to re-exctract the .zip file every time you use it to download images.

Other fascets of solution to the memory issue (documented in the README of the gbif_download_standalone repo) have involved:

When reading the dwca occurrence dataframes, only reading the desired columns which reduces the file size drastically.
Saving the truncated occurrence dataframe as a CSV file. This reduces its file drastically.
Splitting the big lepidoptera occurrence by species, and saving a separate CSV file for each. The downstreadm download code finds the corresponding CSV file for each species and only loads that into memory.

AMI-system / gbif-species-trainer-AMI-fork