NatLibFi / bib-rdf-pipeline

Scripts and configuration for converting MARC bibliographic records into RDF
Creative Commons Zero v1.0 Universal
29 stars 5 forks source link

Many series merged into a single entity #79

Closed osma closed 6 years ago

osma commented 6 years ago

The series W00017282900 seems to be incorrectly merged from many series. Similar to #76 and/or #70

osma commented 6 years ago

Possibly related to empty ISSN keys #80

osma commented 6 years ago

Would probably need to plot this using something like Gephi, to find out what's the problematic work key (or several) that is pulling all of these together.

osma commented 6 years ago

It's not a single key that's the culprit, though there are problematic ones such as julkaisut which probably should be blacklisted. The issue seems to be similar to #70: when a record has multiple series statements, being part of different series, somehow the keys get mixed up so that the title-based and ISSN-based keys are incorrectly coupled. For example this record seems to cause trouble:

005936207 4901  L $$aAmos Andersonin taidemuseon julkaisuja. Uusi sarja,$$x1795-9683 ;$$vnro 79
005936207 4901  L $$aSuomalaisen Kirjallisuuden Seuran toimituksia,$$x0355-1768 ;$$v1338
005936207 4901  L $$aAmos Andersonin taidemuseon julkaisuja. Uusi sarja,$$x0788-0138 ;$$vnro 79
005936207 830 0 L $$aAmos Andersonin taidemuseon julkaisuja.$$pUusi sarja,$$x0788-0138 ;$$vnro 77.
005936207 830 0 L $$aAmos Andersonin taidemuseon julkaisuja.$$pUusi sarja,$$x1795-9683 ;$$vnro 79.
005936207 830 0 L $$aSuomalaisen Kirjallisuuden Seuran toimituksia,$$x0355-1768 ;$$v1338.
005936207 830 0 L $$aAmos Andersonin taidemuseon julkaisuja.$$pUusi sarja,$$x0788-0138 ;$$vnro 79.

will get these series keys:

<http://urn.fi/URN:NBN:fi:bib:me:W00593620702> "amos andersonin taidemuseon julkaisuja uusi sarja" .
<http://urn.fi/URN:NBN:fi:bib:me:W00593620702> "issn:1795-9683" .
<http://urn.fi/URN:NBN:fi:bib:me:W00593620703> "amos andersonin taidemuseon julkaisuja uusi sarja" .
<http://urn.fi/URN:NBN:fi:bib:me:W00593620703> "issn:0355-1768" .
<http://urn.fi/URN:NBN:fi:bib:me:W00593620704> "issn:0788-0138" .
<http://urn.fi/URN:NBN:fi:bib:me:W00593620704> "suomalaisen kirjallisuuden seuran toimituksia" .
<http://urn.fi/URN:NBN:fi:bib:me:W00593620705> "amos andersonin taidemuseon julkaisuja uusi sarja" .
<http://urn.fi/URN:NBN:fi:bib:me:W00593620705> "issn:0788-0138" .

Out of these, at least W00593620703 is problematic: the ISSN and title don't match, The ISSN 0355-1768 is for "Suomalaisen Kirjallisuuden Seuran julkaisuja", not "Amos Andersonin taidemuseon julkaisuja" which has ISSN 0788-0138.

osma commented 6 years ago

Opened https://github.com/lcnetdev/marc2bibframe2/issues/71 . I think the way marc2bibframe2 couples information from 490 fields with 830 fields is part of the problem, though in the case of the above record, there are also problems with the data itself (e.g. wrong ISSNs and volume numbers).

osma commented 6 years ago

One suggested workaround is to remove during preprocessing all 490 fields from the records if a 830 field exists in the record. This way at least the values from 490 and 830 fields wouldn't be incorrectly coupled, even if it means losing some information. ISSNs should be more likely to appear in 830 fields than 490, so most of them would be retained.

osma commented 6 years ago

Fixed by ff987f127e01360579b8738d86b7fce3a16a958a