compare old and new Rolnick codebase for GBIF image downloading

LevanBokeria commented 1 year ago

The old codebase imports the occurrences module from pygbif, and then iteratively searches the GBIF online database for images. The new codebase downloads the dwca files which contain the occurrence dataframes, lookes through that dataframe, gets image URLs from the corresponding entries in the media dataframe, and then downloads those images.

My suspicion is that the new way is way faster. Will compare the two approaches

LevanBokeria commented 1 year ago

Compared the speed of downloading. The old way that pings the API to get metadata for each species is faster than the new way that uses pre-downloaded DwC-A files.

Old way, 200 images for 1 species - 107 seconds. New way, 200 images for 1 species - 176 seconds.

Also tried for multiple species, perhaps the new way prevails in this case. Old way, 16 species, 20 images per species - 182 seconds. Compared to 309 seconds with the new way.

Bottom line, old way is faster based on my testing.

However, I asked Aditya and Fagner for their opinion. Fagner suggested using the DwC-A method, so will build a pipeline around that method. Email correspondence below:

Hi Levan,

Both approaches work well and have their pros and cons. I am going to highlight them from my point of view.

Using pygbif: Pros: It is easier to fetch data for selected species as the API supports searching based on species name or species id. If you already have a pipeline for fetching data, it is easier to add it. Cons: If your dataset is large, it may take days to fetch only the metadata You may miss the data for some species due to server-side errors, timeouts, pagination errors. These errors are not rare. The data can change while you are fetching them.

Using DwC-A: Pros: You can download all metadata at once. It takes only a few minutes for GBIF generate it even for datasets with millions of occurrences. You have a frozen version of the database (GBIF also generates a DOI for each DwC-A that can be used later for citation purposes). Cons: Depending on the size of the DwC-A, you will need more RAM to load the dataset. You may have to adapt your code to use the DwC-A format, but the data can be easily converted to a pandas dataframe. I prefer using DwC-A because it is much faster and reliable to get metadata.

Best, Fagner

To add a point to Fagner's comments: for most species, the number of images available through pygbif was much less than the actual available data on GBIF. However, using DwC-A gives an exact copy of the current database.

Regards, Aditya

LevanBokeria commented 1 year ago

Fagner has further clarified that although the dwca method might be slower when compared with the pygbif method one-to-one, it allows for parallelization of processes:

Hi Levan,

Regarding the download speed of the DwC-A approach, it depends on the number of CPUs and nodes you allocate. If you are using a shared network filesystem, you can run multiple jobs in parallel using the same target location (--dataset_path), the script will skip the images already downloaded. You can also run multiple processes on each job, one worker per CPU (see the --num_workers option). I usually run 4 jobs with 32 workers each.

Best, Fagner

AMI-system / gbif_download_standalone

compare old and new Rolnick codebase for GBIF image downloading #1