refactor dwc2parquet job into separate repository

bio-guoda / idigbio-spark

processing engine for biodiversity archives

0 stars 1 forks source link

refactor dwc2parquet job into separate repository #1

Open jhpoelen opened 8 years ago

jhpoelen commented 8 years ago

suggest include namespace prefixes as used here: https://github.com/iDigBio/idb-backend/blob/master/idb/helpers/fieldnames.py

mjcollin commented 8 years ago

Prefixes are ok but I'd like to ask that column names not include ":" or ".". See https://github.com/bio-guoda/guoda-datasets/issues/1

I've been unable to find a trick for working with ":" in column names from the py4j version that ships with Spark 2.0.

jhpoelen commented 8 years ago

@mjcollin ":" and "." are definitely a source of funkiness. Hopefully, we can figure out a way to deal with this without having to change too much in existing jobs.