idigbio-spark

Generate taxonomic checklists and occurrence collections from biodiversity collections like GBIF, iDigBio. Converts DwCA tracked by Preston into parquet and sequence files to enable parallel processing in a compute cluster.

This library relies on an apache spark and Mesos/HDFS clusters to:

generate checklists
generate occurrence collection
import Darwin Core Archive into apache parquet data formats

At time of writing (June 2017), this library is used by http://effechecka.org and https://gimmefreshdata.github.io . Note that effechecka and freshdata projects are not longer active.

Funding

This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.

bio-guoda / idigbio-spark

readme

idigbio-spark

Funding