Support PICA Path expression syntax

nichtich commented 2 years ago

Using pica-rs together with PICA::Data (Catmandu and/or picadata) is confusing because both use slightly different syntax to reference fields and subfields. Following this analysis I tried pica-rs 0.7.0 and checked for compatibility with PICA Path expression syntax (not including deprecated features):

[x] 1. Full tag (e.g. 003@)
[x] 2. Wildcard character . in tag (e.g. 0...)
[x] 4. Occurrence preceded by / (e.g. 022A/01)
[ ] 6. Wildcard character . in occurrence (e.g. 028B/0.)
[x] 7. Occurrence range (e.g. 028B/01-02). Pattern of a range is [0-9]+-[0-9]+.
[x] 8. Any occurrence (e.g. 028B/*) not implemented in PICA::Data yet, will be part of next release
[x] 9. Subfield preceded by . (e.g. 003@.0) not implemented in PICA::Data yet, will be part of next release
[x] 10. Subfield preceded by $ (e.g. 003@$0)
[ ] 11. Subfield without prefix (e.g. 003@0 for 003@$0)
[x] 12. Multiple subfields (e.g. 022A$ap or 022A.ap)

pica-rs adds some more features such as wildcards with [...] in tags (e.g. 02[89]A) that may also become part of language-independent PICA Path expression syntax but first we need a common set of minimal language features. I could argue about 6 and 11 but 2, 10 and 12 are essential. Here is an example of an expression that uses these syntax features and its corresponding expression supported by pica-rs:

00..$ab
00[0123456789][ABCDEFGHIJKLMNOPQRSTUVWXY@].[ab]

This is not about which syntax is better, but to get to a common PICA Path expression syntax with breaking as less existing features as possible. So in short I propose the following extensions to the record matcher syntax:

Allow dot as wilcard for any allowed characters in tags (e.g. 0... for all fields on level 0) by extending parser rule TagMatcherPattern with dot for Digit0, Digit1, Digit2, Digit3
Allow $ as subfield indicator alternatively to the dot (e.g 002@$0 equivalent to 002@.0) by extending parser rule DotExpr
Extend parser rule SubfieldCodes to SubfieldCode | "[" SubfieldCode+ "]" | SubfieldCode+ | "*"

nichtich commented 2 years ago

See https://github.com/gbv/PICA-Data/issues/109#issuecomment-981562847 for a formal grammar of PICA Path expression syntax as implemented in PICA::Data (dev branch).

nwagner84 commented 2 years ago

Updates:

Allow $ as subfield indicator (#355)
Dot-Wildcard in tag matchers (#356)
Extend parser rule SubfieldCodes (#357)

nichtich commented 2 years ago

What's your opinion on subfield without prefix (e.g. 003@0 for 003@$0)? This is lazy syntax supported in PICA::Data, but we should not encourage it for readability. Nevertheless the parser could allow for it.

Adding wildcard character . in occurrence (e.g. 028B/0.) is less relevant. I would make it a deprecated PICA::Data extension and not include it in the final specification of PICA Path expression syntax.

See current state of PICA Path expression syntax specification at http://format.gbv.de/query/picapath

nwagner84 commented 2 years ago

What's your opinion on subfield without prefix (e.g. 003@0 for 003@$0)? This is lazy syntax supported in PICA::Data, but we should not encourage it for readability. Nevertheless the parser could allow for it.

I am still not convinced that the "lazy syntax" is a good idea, but i'm going to discuss this with my colleagues in the upcoming weeks.

nichtich commented 2 years ago

I fully agree that using the dot as subfield indicator is better than no character (e.g. 003@.0 instead of 003@0) and I would not allow the lazy syntax if building from scratch. The argument for application of Robustness principle at this point is compatibility with an existing syntax. Anyway, my main motivation is to get a final specification of PICA Path syntax.

Do you think the [...] syntax for tag ranges (e.g. 012[XY]) should also become part of the specification?

nwagner84 commented 2 years ago

I'm not happy doing this, but the next pica version will support the lazy syntax (see #362).

Tag ranges should not become part of the specification.

nwagner84 commented 2 years ago

closed, because 11. is implemented (see #362) and wildcard-characters in occurrence (6.) won't be implemented.

deutsche-nationalbibliothek / pica-rs

Support PICA Path expression syntax #346