Closed LevanBokeria closed 1 year ago
To document progress made on this front over the last months:
With the big lepidoptera.zip
file, the process used to get killed because the dwca reader package was trying to unzip the contents in a temporary directory, and that temporary directory used to be assigned 100GB maximum space, while the .zip file contents were more than that.
My initial solution involved passing a custom temporary folder to extract dwca file contents to. But this took a lot of time and each time wrote 200GB+ worth of files to disk.
I later discovered that the dwca reader package can also read already extracted dwca files. So the solution involves manually extracting the lepidoptera.zip
file once in a folder in our project directory (make sure you have 200GB+ space available), and then point that directory to the dwca reader package. This avoids having to re-exctract the .zip file every time you use it to download images.
Other fascets of solution to the memory issue (documented in the README of the gbif_download_standalone repo) have involved:
This issue was raised on the parent repo here - https://github.com/RolnickLab/gbif-species-trainer/issues/2
Documenting it here for the AMI fork repository, and expanding with the latest updates:
Problem description:
The GBIF dwca file for Lepidoptera is huge, about 30GB zipped. The script to download GBIF images using dwca files
02-fetch_gbif_moth_data.py
runs out of memory and is automatically killed by slurm.I tried running a smaller download, just for one family of moths Erebidae, which has a dwca zipped file of 1.5GB. But the process still exploaded in memory use and was killed.
The cause of the problem:
I investigated the 02-fetch_gbif_moth_data.py code, and the likely reason for such high RAM use is parallelisation of processes that happen on lines 187-188:
This happens after the dwca file is loaded and occurrence data is read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each taxon_keys the occurrence dataframe is duplicated. This is likely resulting in astronomical memory usage.
Additionally, the dwca files which are read into memory using the dwca-reader package take really large amount of space. The main culprit is the occurrence.txt file. When read into memory, these dataframes take unexpectedly large amount of space. For example, a zipped dwca file of ~200MB contains a ~1GB occurrence.txt file, but when read into memory consumes ~3.5GB of RAM.
Proposed solution:
I have turned of parallelisation in the download code, so the images for each species are downloaded serially. Takes a lot more time. I have downloaded separate dwca files for the 16 moth families provided by David Roy in the initial moths checklist. I have written a wrapper python file which takes the family-specific dwca file as an argument and will call the functions from the
02-fetch_gbif_moth_data.py
file.Will make a new branch and open a PR to document these changes.