bio-guoda / idigbio-spark

processing engine for biodiversity archives
0 stars 1 forks source link

Build Status

idigbio-spark

Generate taxonomic checklists and occurrence collections from biodiversity collections like GBIF, iDigBio. Converts DwCA tracked by Preston into parquet and sequence files to enable parallel processing in a compute cluster.

This library relies on an apache spark and Mesos/HDFS clusters to:

  1. generate checklists
  2. generate occurrence collection
  3. import Darwin Core Archive into apache parquet data formats

At time of writing (June 2017), this library is used by http://effechecka.org and https://gimmefreshdata.github.io . Note that effechecka and freshdata projects are not longer active.

Funding

This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.