bio-guoda / idigbio-spark

processing engine for biodiversity archives
0 stars 1 forks source link

Build Status


Generate taxonomic checklists and occurrence collections from biodiversity collections like GBIF, iDigBio. Converts DwCA tracked by Preston into parquet and sequence files to enable parallel processing in a compute cluster.

This library relies on an apache spark and Mesos/HDFS clusters to:

  1. generate checklists
  2. generate occurrence collection
  3. import Darwin Core Archive into apache parquet data formats

At time of writing (June 2017), this library is used by and . Note that effechecka and freshdata projects are not longer active.


This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.