LibreCat / Catmandu

Catmandu - a data processing toolkit
https://librecat.org
175 stars 31 forks source link

Port dc_breaker.py to Catmandu #208

Closed nichtich closed 8 years ago

nichtich commented 8 years ago

I just stumbled upon the Code4Lib journal article Metadata Analysis at the Command-Line from around 3 years ago. It illustrates metadata processing with standard Unix tools (grep, sort, uniq...) and a custom script dc_breaker.py availbale at https://github.com/vphill/metadata_breakers. We could go through the article and show how Catmandu can be used to achieve the same. If something does not work well, we spotted a gap in Catmandu based on this use-case.

One feature of dc_breaker.py which is not fully available in Catmandu is a nice statistic of field usage. Catmandu::Stat does similar but not as nice visualized (could be done with another Exporter), less clear and with a limitation (https://github.com/LibreCat/Catmandu-Stat/issues/6).

phochste commented 8 years ago

As a start I created importer and exporter that can handle the Breaker format: https://github.com/LibreCat/Catmandu-Breaker

phochste commented 8 years ago

Then the examples are:

 # Harvest a dataset in breaker format
 $ catmandu convert OAI --url http://biblio.ugent.be/oai to Breaker > data.breaker

 # Select all creators out of this data
 $ catmandu convert Breaker to TSV --fix 'select all_match(tag,'creator'); retain(data)' < data.breaker

 # Select all id and creators
 $ catmandu convert Breaker to TSV --fix 'select all_match(tag,'creator'); retain(_id,data)' < data.breaker