Closed nichtich closed 2 years ago
See https://github.com/gbv/PICA-Data/issues/109#issuecomment-981562847 for a formal grammar of PICA Path expression syntax as implemented in PICA::Data (dev branch).
Updates:
$
as subfield indicator (#355)SubfieldCodes
(#357)What's your opinion on subfield without prefix (e.g. 003@0
for 003@$0
)? This is lazy syntax supported in PICA::Data, but we should not encourage it for readability. Nevertheless the parser could allow for it.
Adding wildcard character .
in occurrence (e.g. 028B/0.
) is less relevant. I would make it a deprecated PICA::Data extension and not include it in the final specification of PICA Path expression syntax.
See current state of PICA Path expression syntax specification at http://format.gbv.de/query/picapath
What's your opinion on subfield without prefix (e.g.
003@0
for003@$0
)? This is lazy syntax supported in PICA::Data, but we should not encourage it for readability. Nevertheless the parser could allow for it.
I am still not convinced that the "lazy syntax" is a good idea, but i'm going to discuss this with my colleagues in the upcoming weeks.
I fully agree that using the dot as subfield indicator is better than no character (e.g. 003@.0
instead of 003@0
) and I would not allow the lazy syntax if building from scratch. The argument for application of Robustness principle at this point is compatibility with an existing syntax. Anyway, my main motivation is to get a final specification of PICA Path syntax.
Do you think the [...]
syntax for tag ranges (e.g. 012[XY]
) should also become part of the specification?
I'm not happy doing this, but the next pica version will support the lazy syntax (see #362).
Tag ranges should not become part of the specification.
closed, because 11. is implemented (see #362) and wildcard-characters in occurrence (6.) won't be implemented.
Using pica-rs together with PICA::Data (Catmandu and/or picadata) is confusing because both use slightly different syntax to reference fields and subfields. Following this analysis I tried pica-rs 0.7.0 and checked for compatibility with PICA Path expression syntax (not including deprecated features):
003@
).
in tag (e.g.0...
)/
(e.g.022A/01
).
in occurrence (e.g.028B/0.
)028B/01-02
). Pattern of a range is[0-9]+-[0-9]+
.028B/*
) not implemented in PICA::Data yet, will be part of next release.
(e.g.003@.0
) not implemented in PICA::Data yet, will be part of next release$
(e.g.003@$0
)003@0
for003@$0
)022A$ap
or022A.ap
)pica-rs adds some more features such as wildcards with
[...]
in tags (e.g.02[89]A
) that may also become part of language-independent PICA Path expression syntax but first we need a common set of minimal language features. I could argue about 6 and 11 but 2, 10 and 12 are essential. Here is an example of an expression that uses these syntax features and its corresponding expression supported by pica-rs:This is not about which syntax is better, but to get to a common PICA Path expression syntax with breaking as less existing features as possible. So in short I propose the following extensions to the record matcher syntax:
0...
for all fields on level 0) by extending parser ruleTagMatcherPattern
with dot forDigit0
,Digit1
,Digit2
,Digit3
$
as subfield indicator alternatively to the dot (e.g002@$0
equivalent to002@.0
) by extending parser ruleDotExpr
SubfieldCodes
toSubfieldCode | "[" SubfieldCode+ "]" | SubfieldCode+ | "*"