bio-guoda / idigbio-spark

processing engine for biodiversity archives
0 stars 1 forks source link

Add line number and file source to dwca2parquet output #6

Open ialzuru opened 5 years ago

ialzuru commented 5 years ago

Current parquet files mention the hash of the dataset that a record came from but not the specific file and row number. If a dataset contains two copies of the same record, it is not easy to distinguish them consistently.

https://github.com/bio-linker/organization/wiki/2019-08-23-Work-Session-Notes