gbv / coli-ana

API to analyze DDC numbers
https://coli-conc.gbv.de/coli-ana/app/
MIT License
2 stars 0 forks source link

Fix base number detection #34

Closed nichtich closed 2 years ago

nichtich commented 3 years ago

Detection of base number in pica.js is still broken. See for instance 331.89041026. Base number should be 331.89041, not 331.

stefandesu commented 3 years ago

The test for base number detecting is failing:

  1) picaFromDDC                                                                                                                    
       should detect base number:                                                                                                   

      AssertionError: expected '045H/10 $eDDC23ger$a282.5$c282.$g5$Acoli-ana' to equal '045H/10 $eDDC23ger$a282.5$c282$g5$Acoli-ana'
      + expected - actual                                                                                                           

      -045H/10 $eDDC23ger$a282.5$c282.$g5$Acoli-ana                                                                                 
      +045H/10 $eDDC23ger$a282.5$c282$g5$Acoli-ana
nichtich commented 2 years ago

I'm still not sure how to define base number. With WebDewey the same number 709.044can be build in multiple ways:

DDC record for 700.9 includes the decomposition 7 + T1--09. Starting to build a number with 700.9 (or below) in WebDewey will fall back to 7. Therefore I think the base number is the shortest number that can be used to start a composition.

coli-ana interface uses 700.904 as base number because this is included in the main schedule. PICA field uses 7.

nichtich commented 2 years ago

Another example: Decomposition as created by DNB (this record) with base number 700.904:

045H/10 $eDDC23ger$a700.90440747471$c700.904$f09044$f0901-0905:074$g7471$ADE-601

Decomposition by coli-ana API (base number 7):

045H/20 $eDDC23ger$a700.90440747471$c7$f09044$f074$g7471$Acoli-ana
nichtich commented 2 years ago

Two possibilities to determine the base number:

  1. Base number is the longest possible DDC notation that does not share positioned digits with another notation (currently active, see implementation)

  2. Base number is the longest possible DDC notation that in the hierarchy (currently the first notation shown in bold), based on plain string comparision

Given this example

7------- Künste und Unterhaltung (700)
70------ Künste (700)
700----- Künste (700)
700.9--- Standardschlüssel für die Künste (700.1-700.9)
700.9--- Geschichte, geografische Behandlung, Biografien der Künste #dno_syn# (700.9)
700.904- Künste--20. Jahrhundert (700.904)
-0------ Facettenindikator (0)
--0----- Hilfstafel 1. Standardschlüssel (T1--0)
--0.9--- Geschichte, geografische Behandlung, Biografien (T1--09)
--0.904- Zeitabschnitte (T1--0901-0905)
--0.904- *20. Jahrhundert, 1900–1999 (T1--0904)
--0.9044 *1940–1949 (T1--09044) 
  1. base number is 7 because the next digit is also used in -0------
  2. base number is 700.904 because all number before (7...700.9) are substrings