cioos-siooc / ckan

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers datahub.io, catalog.data.gov and europeandataportal.eu/data/en/dataset among many other sites.
http://ckan.org/
Other
2 stars 4 forks source link

Harvest doi citation from <code> when <codeSpace> is doi.org #191

Closed ItaloBorrelli closed 3 months ago

ItaloBorrelli commented 1 year ago

Current behaviour: if identifier <code> has text matched by the regex here then it will be used for the citation.

Expected behaviour: if <codeSpace> is doi.org then the <code> value should be used as the doi citation identifier.

fostermh commented 1 year ago

Hey Italo, thanks for reporting. could you provide an example identifier snippet? I'm assuming you are not including the full url in the code field?

ItaloBorrelli commented 1 year ago

Sure thing! Taking a look at this xml we have:

<mdb:MD_Metadata...
  <mdb:identificationInfo>
    <mri:MD_DataIdentification>
      <cit:citation>
        <cit:CI_Citation>
          ...
          <cit:identifier>
            <mcc:MD_Identifier>
              <mcc:authority>
                <cit:CI_Citation>
                  <cit:title>
                    <gco:CharacterString>DataCite</gco:CharacterString>
                  </cit:title>
                </cit:CI_Citation>
              </mcc:authority>
              <mcc:code>
                <gco:CharacterString>10.34943/4831b0a0-7f01-4863-b44f-2ef0729d45ef</gco:CharacterString>
              </mcc:code>
              <mcc:codeSpace>
                <gco:CharacterString>doi.org</gco:CharacterString>
              </mcc:codeSpace>
            </mcc:MD_Identifier>

I checked with my metadata team because I wasn't confident and they are sure that this is a valid way of providing the DOI within ISO 19115 xml.

fostermh commented 1 year ago

it is valid yes. The problem is that there are also other valid ways to represent it. for example, one could put the full URI in the code field. If using a version of the ISO standard other than 19115-3, this is frequently done. This would look like https://doi.org/10.34943/4831b0a0-7f01-4863-b44f-2ef0729d45ef for example.

I can make a small change which I think will accommodate your use case but the https:// is a bit of a problem as it can be hard to know if it should be included or not. In the case of doi.org it is probably safe to assume it is a full URL. On the other hand in the case of datasets from 'GLOS', for example, it is less clear as their identifiers look like a URL but in fact do not link to anything.

ItaloBorrelli commented 1 year ago

Do you mean that if we add http or https to it it should be recognized as the doi url, and it's not working because of the exclusion of the protocol for matching? I haven't tried that out but I'll give it a go shortly. If http(s?)://doi.org is valid I can check if that would be ok for me to use for the codeSpace instead of just doi.org.

fostermh commented 1 year ago

that would make life easier yes. Alternatively 'code' could be the full url while code space is 'doi.org' and authority is 'Data Cite'. There are many ways to do this. The iso recommendation seems to be either break it up into the 3 parts or use a full url for code.

ItaloBorrelli commented 11 months ago

We have added the protocol to the citation as I think we decided was the solution. You can see here:

<mcc:MD_Identifier>
<mcc:authority>
<cit:CI_Citation>
<cit:title>
<gco:CharacterString>DataCite</gco:CharacterString>
</cit:title>
</cit:CI_Citation>
</mcc:authority>
<mcc:code>
<gco:CharacterString>10.34943/d123e437-f06f-48f6-87a0-121a938ef792</gco:CharacterString>
</mcc:code>
<mcc:codeSpace>
<gco:CharacterString>https://doi.org</gco:CharacterString>
</mcc:codeSpace>
</mcc:MD_Identifier>
</cit:identifier>

Is this sufficient? Is there work that needs to be done on for the harvester to accommodate this as well?

*edited to remove question from code block

fostermh commented 7 months ago

this appears to be resolved

ItaloBorrelli commented 6 months ago

Should be fixed with: https://github.com/cioos-siooc/ckanext-spatial/pull/44