deutsche-nationalbibliothek / pica-rs

Tools to work with bibliographic records encoded in PICA+.
https://deutsche-nationalbibliothek.github.io/pica-rs/
European Union Public License 1.2
32 stars 5 forks source link

Support PICA Path expression syntax #346

Closed nichtich closed 2 years ago

nichtich commented 2 years ago

Using pica-rs together with PICA::Data (Catmandu and/or picadata) is confusing because both use slightly different syntax to reference fields and subfields. Following this analysis I tried pica-rs 0.7.0 and checked for compatibility with PICA Path expression syntax (not including deprecated features):

pica-rs adds some more features such as wildcards with [...] in tags (e.g. 02[89]A) that may also become part of language-independent PICA Path expression syntax but first we need a common set of minimal language features. I could argue about 6 and 11 but 2, 10 and 12 are essential. Here is an example of an expression that uses these syntax features and its corresponding expression supported by pica-rs:

00..$ab
00[0123456789][ABCDEFGHIJKLMNOPQRSTUVWXY@].[ab]

This is not about which syntax is better, but to get to a common PICA Path expression syntax with breaking as less existing features as possible. So in short I propose the following extensions to the record matcher syntax:

nichtich commented 2 years ago

See https://github.com/gbv/PICA-Data/issues/109#issuecomment-981562847 for a formal grammar of PICA Path expression syntax as implemented in PICA::Data (dev branch).

nwagner84 commented 2 years ago

Updates:

nichtich commented 2 years ago

What's your opinion on subfield without prefix (e.g. 003@0 for 003@$0)? This is lazy syntax supported in PICA::Data, but we should not encourage it for readability. Nevertheless the parser could allow for it.

Adding wildcard character . in occurrence (e.g. 028B/0.) is less relevant. I would make it a deprecated PICA::Data extension and not include it in the final specification of PICA Path expression syntax.

See current state of PICA Path expression syntax specification at http://format.gbv.de/query/picapath

nwagner84 commented 2 years ago

What's your opinion on subfield without prefix (e.g. 003@0 for 003@$0)? This is lazy syntax supported in PICA::Data, but we should not encourage it for readability. Nevertheless the parser could allow for it.

I am still not convinced that the "lazy syntax" is a good idea, but i'm going to discuss this with my colleagues in the upcoming weeks.

nichtich commented 2 years ago

I fully agree that using the dot as subfield indicator is better than no character (e.g. 003@.0 instead of 003@0) and I would not allow the lazy syntax if building from scratch. The argument for application of Robustness principle at this point is compatibility with an existing syntax. Anyway, my main motivation is to get a final specification of PICA Path syntax.

Do you think the [...] syntax for tag ranges (e.g. 012[XY]) should also become part of the specification?

nwagner84 commented 2 years ago

I'm not happy doing this, but the next pica version will support the lazy syntax (see #362).

Tag ranges should not become part of the specification.

nwagner84 commented 2 years ago

closed, because 11. is implemented (see #362) and wildcard-characters in occurrence (6.) won't be implemented.