MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

consider repartitioning for static harvests #221

Open ghukill opened 6 years ago

ghukill commented 6 years ago

Looks as though submitting tar.gz files results in a single partition, at least early on in spark workflow. By contrast, pointing to a directory with XML files results in the number of partititions as there are files (e.g. 1984 in one case).

It might make sense to rationalize this a bit and partition after reading files: partition up from archive files, slim down from large number of files.