Closed aindlq closed 6 months ago
Normalization script from ETL-Controller is trying to cleanup a provenance field (561
) with some heuristics that was to eager to remove text with ()
, and when the closing bracket was in the end of text node it removed the whole text node.
E.g for the record - https://library.frick.org/discovery/fulldisplay?docid=alma991006806479707141&context=L&vid=01NYA_INST:Frick&lang=en&search_scope=Frick&adaptor=Local%20Search%20Engine&tab=SearchScopes&query=any,contains,b12972575&offset=0 ,it tries to remove (d)
from (d) William Loring Andrews, New York, the husband of the subject (died 1920);
or (d, e and f)
from (d, e and f) given by her to the Metropolitan Museum of Art, New York, in 1940 (40.144).
Apparently these letters are references to a bibliography, specified in the field 556
, e.g:
556 ##$a(d) New York, Metropolitan Museum of Art. American Paintings: A Catalogue of the Collection of the Metropolitan Museum of Art, Volume 2. Comps. Stuart P. Feld, et al. 1965, pp.142-143.
556 ##$a(e) Information from Frick Art Reference Library Photoarchive.
556 ##$a(f) Metropolitan Museum of Art, New York, Website, February 2015.
From a quick look, I think it is a common practice in the Frick data to use such cross-reference format.
Worth to keep in mind that current RDF representation is different from the way it was initially modeled, currently it is just crm:P67i_is_referred_to_by
where the original modeling preserved the reference to bibliography.
For future work such statements can probably be represented with CRMInf I10 Provenance Statement
.
It is interesting that referred bibliography is actually available online - https://www.metmuseum.org/art/metpublications/American_Paintings_in_The_Metropolitan_Museum_of_Art_Vol_2_A_Catalogue_of_Works_by_Artists_Born_be?Tag=Church,%20Frederic%20Edwin%20(American,%201826%E2%80%931900)&title=&author=&pt=&tc=&dept=&fmt=
Initial bug was fixed as part of normalization code refactoring, also added unit tests, see https://github.com/ArtResearch/ETL-POC/blob/main/XMLNormalizer/src/test/java/net/artresearch/normalizers/FrickNormalizerTest.java
Created a new issue #524 to track issue with lost bibliography link for provenance data.
There are many
gr.forth.ics.isl.x3ml.X3MLEngine$X3MLException: Empty result for arg ....
exceptions in the 3M output for Frick data. It looks like that XML Normalization removes the data that is expected to be there in mappings.E.g in
.b11259267.xml
record:after normalization becomes:
Indicates that some data is lost at this step. This needs to be investigated.