MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Use OpenRefine for clustering and analysis? #159

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

After a few unrelated workshops with OpenRefine, it's clear that it's a popular application for cleaning up / analyzing metadata. Is there a possible connection with Combine?

Particularly for clustering. We had an instance where there are... 10-20 different date formats across 200k+ records. It's easy to glance and observe a few, but how to rigorously or thoroughly analyze and cluster the different formats? It would be possible to integrate some clustering algorithims into Combine, either in a Spark context or with python, but that feels like recreating some lovely wheels that OpenRefine has already rounded.

Might be worth investigating a link to OpenRefine, which could be included in server build. Would need to consider if the connection was designed for analysis only (one-way), or modification and re-integration to the record's metadata (round-trip).

A round-trip would be difficult, as you'd be updated a particular metadata field in an XML record, but if you would have the XPath in hand from the index mapping, which is how it might map to OpenRefine in the first place. Imagining an interface not dissimilar from the Validation reports, where you can select indexed fields to include in excel, csv export, which would map nicely to OpenRefine's column / row model.

ghukill commented 6 years ago

Closing, addressed in: #170