Closed nichtich closed 2 years ago
Note that cleanup can lead to additional duplicated notations so order of processing is relevant.
Cleanup may also include looking up whether a notation actually exists, e.g. lnot every pattern ^[0-9][0-9].[0-9][0-9]$
is valid BK. As each vocabulary has its own content and rules, best create a service for each vocabulary.
Notations given in PICA contain errors, so better filter out syntactically invalid notations. Keeping them in the full TSV files (e.g.
rvk.tsv
) is useful but further processing should remove them:classifications.csv
and extract notationPattern or directly add notationPattern toclassifications.csv
classification-subjects.sh
to filter out invalid notationsWe may add additional rules for cleanup, e.g. remove
/
in DDC notations, normalize whitespace...