ArtResearch / artresearch.net

ArtResearch ResearchSpace application hosted at https://artresearch.net
5 stars 1 forks source link

Frick. Some data is lost at XML normalization step. #523

Closed aindlq closed 6 months ago

aindlq commented 6 months ago

There are many gr.forth.ics.isl.x3ml.X3MLEngine$X3MLException: Empty result for arg .... exceptions in the 3M output for Frick data. It looks like that XML Normalization removes the data that is expected to be there in mappings.

E.g in .b11259267.xml record:

<marc:datafield tag="561" ind1=" " ind2=" ">
  <marc:subfield code="a">Uffizi, Florence (1471).</marc:subfield>
</marc:datafield>

after normalization becomes:

  <marc:datafield ind1=" " ind2=" " index="100" tag="561">
    <marc:subfield code="a"/>
  </marc:datafield>

Indicates that some data is lost at this step. This needs to be investigated.

aindlq commented 6 months ago

Normalization script from ETL-Controller is trying to cleanup a provenance field (561) with some heuristics that was to eager to remove text with (), and when the closing bracket was in the end of text node it removed the whole text node.

E.g for the record - https://library.frick.org/discovery/fulldisplay?docid=alma991006806479707141&context=L&vid=01NYA_INST:Frick&lang=en&search_scope=Frick&adaptor=Local%20Search%20Engine&tab=SearchScopes&query=any,contains,b12972575&offset=0 ,it tries to remove (d) from (d) William Loring Andrews, New York, the husband of the subject (died 1920); or (d, e and f) from (d, e and f) given by her to the Metropolitan Museum of Art, New York, in 1940 (40.144).

Apparently these letters are references to a bibliography, specified in the field 556, e.g:

556 ##$a(d) New York, Metropolitan Museum of Art. American Paintings: A Catalogue of the Collection of the Metropolitan Museum of Art, Volume 2. Comps. Stuart P. Feld, et al. 1965, pp.142-143. 
556 ##$a(e) Information from Frick Art Reference Library Photoarchive. 
556 ##$a(f) Metropolitan Museum of Art, New York, Website, February 2015. 

From a quick look, I think it is a common practice in the Frick data to use such cross-reference format. Worth to keep in mind that current RDF representation is different from the way it was initially modeled, currently it is just crm:P67i_is_referred_to_by where the original modeling preserved the reference to bibliography.

For future work such statements can probably be represented with CRMInf I10 Provenance Statement.

It is interesting that referred bibliography is actually available online - https://www.metmuseum.org/art/metpublications/American_Paintings_in_The_Metropolitan_Museum_of_Art_Vol_2_A_Catalogue_of_Works_by_Artists_Born_be?Tag=Church,%20Frederic%20Edwin%20(American,%201826%E2%80%931900)&title=&author=&pt=&tc=&dept=&fmt=

aindlq commented 6 months ago

Initial bug was fixed as part of normalization code refactoring, also added unit tests, see https://github.com/ArtResearch/ETL-POC/blob/main/XMLNormalizer/src/test/java/net/artresearch/normalizers/FrickNormalizerTest.java

Created a new issue #524 to track issue with lost bibliography link for provenance data.