deutsche-nationalbibliothek / pica-rs

Tools to work with bibliographic records encoded in PICA+.
https://deutsche-nationalbibliothek.github.io/pica-rs/
European Union Public License 1.2
31 stars 5 forks source link

Add `replace` command #169

Closed nwagner84 closed 1 year ago

nwagner84 commented 3 years ago

Summary

It happens that the rules for valid subfield value changes and a cleanup process must be scheduled and performed. In the meantime (which can take months or years or will never be happen) the values are inconsistent and two or more subfield values exists which are identical. For example, if the provenance value changes from ema-gnd to emagnd. This is frustrating because each subsequent data analysis must do the same cleanup steps until the cleanup process is done.

Details

A new replace command will be added before the data analysis in order to fix this simple string replacement cases. All subsequent processes (R or Python scripts) can benefit from the cleaned subfield values. If the cleanup process is done and the source file contains no invalid subfield values, the replace command can be removed and the result must be the same.

The command could look like this:

$ pica replace "044H{b == 'GND' && 9? && H == 'ema-gnd', H = 'emagnd'}" DUMP.dat
$ pica replace "044H{b == 'GND', H := 'emagnd'}" --and "012A/*.a = 'foo'" DUMP.dat
$ pica replace "012A/*.a = 'foo'" DUMP.dat

This command (re-)uses the syntax of (pica-rs) path expressions, which can have an optional filter on subfield(s) and a list of subfields which sould be replaced. Also a new assignment operator = is introduced.

Implementation

Note This command might entail a conversion from referenced record to it's mutable variant, which results in a slower running time in comparison to commands like cat or count.

nichtich commented 1 year ago

This requires some more elaboration what and how to support without creating yet another programming language or reinvent FCV. In general we could:

But selection which (sub)fields to replace can get very complicated (if value of subfield A is B then add C to X unless Y is Z...).

We already have use case of replacing typos, for instance some records had a letter in front of a DDC notation, so we could remove it with something like:

pica replace "045F.a =~ '^[a-z]([0-9]{3}.*)' with '\$1'"

The result, however must be loaded back into CBS database, we use PICA Patch for this. If pica-rs does the modification of files, we can use picadata to create a diff/patch, e.g.

  003@ $0014122774
- 045F $aa821.3$AOCLC
+ 045F $a821.3$AOCLC
nwagner84 commented 1 year ago

What do you mean with "creating yet another programming language"? This idea is not about inventing a new programming language. To be honest, I would never want a programming language (if-else, functions, recursion, functors, ...) inside such an replace expression. pica-rs is a command-line tool and a programming environment (IDE, debugger, etc.) should not be necessary.

I've updated the idea and added more context about our use case. The results of this command must not be loaded back into the CBS database. As with all other pica-rs command, pica-rs never writes back into the CBS database.

The PICA Patch format you mentioned operates on PICA Plain and can't be used by pica-rs, which operates on normalized PICA+. But I'll do a test: Converting the whole DUMP into PICA Plain and use picadata in combination with PICA Patch for the first example (044H{b == 'GND' && 9? && H == 'ema-gnd', H := 'emagnd'}). I'm really curious about the running time of this setup. As soon as I get results, I'll add a comment to the alternatives section.