4Science / DSpace

This repository contains the 4Science optimized DSpace & DSpace-CRIS distribution.
https://wiki.lyrasis.org/display/DSPACECRIS/
BSD 3-Clause "New" or "Revised" License
43 stars 62 forks source link

Import of subject keywords from Scopus: SplitMetadataContributor does not work properly with SimpleXpathMetadatumContributor as innerContributor #338

Open olli-gold opened 1 year ago

olli-gold commented 1 year ago

I'm not sure, if this can be considered as a bug or if it's rather a feature request... I'll file it as a bug for now, though.

Describe the bug If I use the default configuration of DSpace-CRIS 2022.03.01 for Scopus import, the subject keywords will be imported as a string, in which the single subject keywords are separated by a |.

Trying to solve that, we have noticed problems with the SplitMetadataContributor in an import scenario for Scopus (we discussed that on Slack recently). Goal is to split the subject keywords from Scopus into separate fields, separated by a pipe symbol |. This could be done using the SplitMetadataContributor in scopus-integration.xml. But unfortunately this does not work well with the SimpleXpathMetadatumContributor as the innerContributor. In bibtex-integration.xml there is a working example for the SplitMetadataContributor (https://github.com/4Science/DSpace/blob/dspace-cris-7/dspace/config/spring/api/bibtex-integration.xml#L43-L51), but this one is using just a SimpleMetadataContributor as the innerContributor, which is addressing the target field in a different way and an additional bean is not required for the mapping (in contrast to the SimpleXpathMetadatumContributor).

To Reproduce Steps to reproduce the behavior:

  1. Use the configuration snippet beyond in scopus-integration.xml
  2. Import a record with multiple subject keywords (for example Scopus ID 2-s2.0-18644372692)
  3. Nothing will happen: the keywords from Scopus will not be taken

Configuration snippet scopus-integration.xml:

    <bean id="scopusAuthkeywordsContrib" class="org.dspace.importer.external.metadatamapping.contributor.SplitMetadataContributor">
        <constructor-arg name="innerContributor">
            <bean class="org.dspace.importer.external.metadatamapping.contributor.SimpleXpathMetadatumContributor">
                <property name="field" ref="scopus.authkeywords"/>
                <property name="query" value="ns:authkeywords"/>
                <property name="prefixToNamespaceMapping" ref="scopusNs"/>
            </bean>
        </constructor-arg>
        <constructor-arg name="regex" value="|"/>
    </bean>

    <bean id="scopus.authkeywords" class="org.dspace.importer.external.metadatamapping.MetadataFieldConfig">
        <constructor-arg value="dc.subject"/>
    </bean>

My best guess is, that this cannot work, because the Field mapping is done within a bean, which is not visible for the innerContributor of scopusAuthkeywordsContrib.

Expected behavior I would expect to have all of the subject keyword in separate dc.subject fields.

Related work I am not aware of any related PR about this.