clarinsi / clarin-dspace

LINDAT/CLARIN digital repository based on DSpace
http://lindat.cz
BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

Some item licenses not displaying #1

Closed cyplas closed 7 years ago

cyplas commented 7 years ago

Some item licenses are not appearing properly for some items. Not sure if this is a new problem or if we just didn't notice it before.

For instance, the license doesn't display on https://www.clarin.si/repository/xmlui/handle/11356/1036 (or in the item list when doing a basic search), although it does seem to have the right info in the table: https://www.clarin.si/repository/xmlui/handle/11356/1036?show=full.

However, it does appear properly for many other items, even if they have the same license (https://www.clarin.si/repository/xmlui/handle/11356/1055). Also, item 1036 displays the license fine on fado (https://beta.clarin.si/repository/xmlui/handle/11356/1036), even though both fido and fado are on the same git branch (si-master) on clarin-dspace (and lindat-common too).

kosarko commented 7 years ago

Both the item view and the listing are using https://www.clarin.si/repository/xmlui/metadata/handle/11356/1036/mets.xml to render the license info In the case of your production server the info is for some reason missing

beta:
<mets:amdSec ID="amd_2">
<mets:rightsMD ID="rightsMD_1556_UFAL_LICENSES">
<mets:mdWrap MDTYPE="OTHER" OTHERMDTYPE="UFAL_LICENSES">
<mets:xmlData>
<license label="PUB" label_title="Publicly Available" url="http://creativecommons.org/licenses/by-nc-sa/4.0/">
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
<labels>
<label label="CC" label_title="Distributed under Creative Commons"/>
<label label="BY" label_title="Attribution Required"/>
<label label="NC" label_title="Noncommercial"/>
<label label="SA" label_title="Share Alike"/>
</labels>
</license>
</mets:xmlData>
</mets:mdWrap>
</mets:rightsMD>
</mets:amdSec>

clarin.si:
<mets:amdSec ID="amd_2">
<mets:rightsMD ID="rightsMD_1556_UFAL_LICENSES">
<mets:mdWrap MDTYPE="OTHER" OTHERMDTYPE="UFAL_LICENSES">
<mets:xmlData/>
</mets:mdWrap>
</mets:rightsMD>
</mets:amdSec>

This gets populated from somewhere in ItemAdapter.java or more precisely by LicensesDisseminationCrosswalk.java

My very wild guess is that the license on line 83 is null, so the xml is not filled in. If that's the case that should be fixeable through edit this item -> License tab, where you would select the license again and click update

Can you see any exceptions in dspace.log or utilities.log? How was the item created? Was that some kind of import? Were the license metadata filled in at a later date? (try checking dc.description.provenance)

cyplas commented 7 years ago

Thanks for the feedback. I've followed up on your tips and questions a bit:

You might be right about the null problem, since your solution seems to have worked: I updated the license to item 1036 and now it shows normally. Thanks!

But I'm not sure how it came to this. Our items are always created manually, not through an import. @TomazErjavec doesn't think that any license metadata was changed, except to update from http to https; but even if that was responsible for the problem, it does not appear to have applied to items consistently, with some old items still having licenses and others not.

Looking at the description.provenance for 3 items (2 of which exhibited the problem), I'm not seeing any direct references to license metadata. Two of the items (although 1 of these had the problem and the other not) register a metadata change, but don't mention the license (although there is ",,"): "Item metadata(dc.subject,,TEI) were added in processAddMetadata by Tomaz;Tomaž Erjavec (tomaz.erjavec@ijs.si) on 2016-12-23T20:53:52Z".

I looked in the logs that you mentioned and didn't find anything obvious. But I'm not sure exactly what to look for and what time period; a grep (-v INFO) for the handle ID in the dspace.log and utilities.log files didn't seem to reveal anything relevant.

Hmm, we could go through all the items and fix the licenses manually, but that would be a little tedious. Do you know if there's a simple way (like an sql query) to see which items lack a license?

kosarko commented 7 years ago

If the problem is really that null, that would mean you have dc.rights.uri that's not in the definitions table. So maybe if you did change it from http to https that change happened only in some places, so it's not matching up now...@amirkamran, @vidiecan any thoughts on this?

There's no simple sql that you could use, as the information is in two databases and cross database joins are somewhat tricky in postgres...although, you could try the following script:

#!/bin/bash
UTIL_DB=dspace5l_licenses
DB=dspace5l
PORT=5433
sudo -u postgres psql -t -p $PORT $DB -c "select resource_id, text_value from metadatavalue natural join metadatafieldregistry where element='rights' and qualifier='uri' and resource_type_id=2;" > id_license.txt
sudo -u postgres psql -t -p $PORT $UTIL_DB -c "select definition from license_definition;" > definitions.txt
awk -F "|" 'NR==FNR{$0 = gensub(/^[ ]*|[ ]*$/,"","g",$0);a[$0]=$0;next}{$2 = gensub(/^[ ]*|[ ]*$/,"","g",$2);if(!($2 in a)){print $1 "has dc.rights.uri=" $2 " thats not defined"}}' definitions.txt id_license.txt

You'll have to set the first three variables (you should find the right values (names and port) in local.properties)... The first query is dumping item ids and dc.rights.uri to a file, the second is dumping the definitions, then awk does an intersection...the gensub parts are whitespacetrimming, the NR==FNR is true only for the first file ie. it loads the definitions in the array a; while reading the second file it checks whether array a contains the dc.rights.uri value and if not prints the id and the value... It ignores items with no dc.rights.uri... No guarantees...

vidiecan commented 7 years ago

@cyplas the bitstreams have been updated/replaced after ingestion (see provenance info) @kosarko when did we fix licensing of changed replaced bitstreams?

cyplas commented 7 years ago

@vidiecan Yes, the bitstreams were updated. I didn't realise/think that this would affect the item licenses.

@kosarko Thanks a lot for the script and explanation, it seems to work:

        1616 has dc.rights.uri=http://creativecommons.org/licenses/by/4.0/ thats not defined
        1605 has dc.rights.uri=http://creativecommons.org/licenses/by/4.0/ thats not defined
        1576 has dc.rights.uri=http://creativecommons.org/licenses/by-nc-sa/4.0/ thats not defined
        1559 has dc.rights.uri=http://creativecommons.org/licenses/by-nc-sa/4.0/ thats not defined
        1549 has dc.rights.uri=http://creativecommons.org/licenses/by-nc-sa/4.0/ thats not defined
        1532 has dc.rights.uri=http://creativecommons.org/licenses/by/4.0/ thats not defined
        1486 has dc.rights.uri=https://lindat.mff.cuni.cz/repository/xmlui/page/licence-pdtsl thats not defined
        1485 has dc.rights.uri=https://lindat.mff.cuni.cz/repository/xmlui/page/licence-pdtsl thats not defined
        1405 has dc.rights.uri=http://creativecommons.org/licenses/by-nc-sa/3.0/ thats not defined
cyplas commented 7 years ago

Aha, those aren't the handle ids, but I got 4 of those by querying the handle db table:

> select handle_id, resource_id from handle where resource_id in (1616, 1605, 1576, 1559, 1549, 1532, 1486, 1485, 1405);
 handle_id | resource_id
-----------+-------------
      1030 |        1549
      1025 |        1532
      1075 |        1616
      1044 |        1559

I checked these 4 and they were indeed licenseless, so I've fixed them manually now, based on the output of your script.

I wonder: what about the other 5? Are these some kinds of irrelevant or obsolete items?

kosarko commented 7 years ago

Those are internal ids, if you are logged in as admin, in the menu on the right theres something like content administration, under that something like item managemnt. When you click that you should be able to enter either handle or internal id and it should take you to edit item page if it exists

Otherwise try checking item table for those item/resource ids

Dne 4. 4. 2017 18:25 napsal uživatel "Cyprian Laskowski" < notifications@github.com>:

Aha, those aren't the handle ids, but I got 4 of those by querying the handle db table:

select handle_id, resource_id from handle where resource_id in (1616, 1605, 1576, 1559, 1549, 1532, 1486, 1485, 1405); handle_id | resource_id -----------+------------- 1030 | 1549 1025 | 1532 1075 | 1616 1044 | 1559

I checked these 4 and they were indeed licenseless, so I've fixed them manually now, based on the output of your script.

I wonder: what about the other 5? Are these some kinds of irrelevant or obsolete items?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clarinsi/clarin-dspace/issues/1#issuecomment-291554550, or mute the thread https://github.com/notifications/unsubscribe-auth/ABwc0Y20Z-k9jJRsxexL5T4tlBSm7YsWks5rsm7XgaJpZM4Mvqwz .