Common syntax to reference PICA fields, independent from content

nichtich commented 3 years ago

Split from #248. The syntax to select and filter PICA records or record content should be the same for all tools, at least for the basic use cases. The most basic use case is to reference a list of PICA fields, independent from their content.

In PICA Path Expression (based on MARCSpec) the current syntax is

# tag                      # optional occurrence or occurrence range
([012.][0-9.][0-9.][A-Z@.])(\[([0-9.]{2,3}|[0-9]+-[0-9]+)\])?

[...] instead of / was used for occurrences because MARCSpec already used / for substring ranges. These are mainly relevant to fixed width MARC fields (having no occurrences and subfields), and / is used before occurrences in PICA Plain anyway. So the common syntax can use / (partly breaking some backwards compatibility in Catmandu::PICA). Open Issues:

Wildcard characters in tags (PICA Path supports .)
Wildcard character in occurrences (PICA Path supports .)
Lists of fields (not part of PICA Path yet)

PICA Fields are often grouped in levels (first digit) and ranges (second and third digit) that's what wildcards in tags are mainly used for. Wildcard character in occurrences are less relevant because they can mainly be replaced by ranges. The . clashes with its use as subfield indicator (alternative to $) in pica-rs, so we could introduce * at the end of a tag instead. (e.g. 0* for level 0 or 001* for system tags on level 0). The syntax would then be (space for readability):

(\* | [012] (\* | [0-9] (\* | ([0-9] ([A-Z@*]) ) ) )
(\/ ([0-9]{2,3} | [0-9]{1,3} - [0-9]{1,3} ) )?

Alternatively keep the . as wildcard.

Lists of multiple fields could be separated by ,, | or any space character (?)

nwagner84 commented 2 years ago

I think it is not necessary to have a unified syntax for selecting fields/subfields. pica-rs provides a first set of syntax rules to express selection and projection operations. At the moment these two basic operations are enough, but I have already ideas to extend or change this syntax (template expressions, aggregation function, etc.). This a specific pica-rs feature which should not be unified between other tools.

nichtich commented 2 years ago

I fully agree, this was more an idea or duplicate. With #346 we have a common subset to reference fields and subfields (and optionally character ranges within subfield values). I've just updated the specification with more explanation (in German), minor details can still be discussed. PICA Path covers referencing as most important use case. Everything beyond (conditions, aggregations, mappings...) depends on particular tools.

deutsche-nationalbibliothek / pica-rs

Common syntax to reference PICA fields, independent from content #271