clarinsi / clarin-dspace

LINDAT/CLARIN digital repository based on DSpace
http://lindat.cz
BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

Unexpected length of local.sponsor #6

Closed cyplas closed 7 years ago

cyplas commented 7 years ago

For some items, the checkmetadata curation task yields: "local.sponsor is\ a component with 5 values but is not stored as such.":

2017-06-21 14:34:37,199 INFO  org.dspace.curate.Curator @ Curation task: checkmetadata performed on: 11356/1024 workflowID=W124 with status: 1. Result: 'ERROR! local.sponsor is\
 a component with 5 values but is not stored as such. [European Commision@@Copernicus -1634@@SQEL - Spoken Queries in European Languages@@euFunds@@@@info:eu-repo/grantAgreement\
/EC/FP7/216342]  Item: http://hdl.handle.net/11356/1125
cyplas commented 7 years ago

I think this is about a mismatch between the local.sponsor complex definition in input-forms.xml and an item's local.sponsor value. In input-forms.xml we have:

         <definition name="funding">
            <input name="4_type" type="dropdown" label="input_forms.complex_definitions.funding.4_type.label" pairs="metashare_funding" class="openaire-type-map" required="true\
"/>
            <input name="2_code" placeholder="input_forms.complex_definitions.funding.2_code.placeholder" type="text" label="input_forms.complex_definitions.funding.2_code.labe\
l" class="openaire-code-autocomplete" autocomplete="solr-local.sponsor_ac" required="true"/>
            <input name="1_orgname" type="text" label="input_forms.complex_definitions.funding.1_orgname.label" required="true"/>
            <input name="3_projname" type="text" label="input_forms.complex_definitions.funding.3_projname.label" autocomplete="solr-local.sponsor_ac" required="true"/>
            <input name="5_openaire_id" value="" type="text" readonly="true" label="input_forms.complex_definitions.funding.5_openaire_id.label" placeholder="Filled out automat\
ically..." mapped-to-if-not-default="dc.relation" class="openaire-id"/>
        </definition>

So, 5 expected values. But in the example I gave above, @@-separated, there are six: [European Commision@@Copernicus -1634@@SQEL - Spoken Queries in European Languages@@euFunds@@@@info:eu-repo/grantAgreement\ /EC/FP7/216342].

I don't know what that empty 5th one is or how it came to be. My best guess is that local.sponsor was previously defined to have 6 values, but one was rarely if ever used, and the definition was changed, while old items still had 6 values in their metadata. However, one of these items (11356/1125) is very new, so, hmm, unless an admin copied/edited the local.sponsor metadata by hand perhaps ... @TomazErjavec, any comments?

In any case, we could fix the existing local.sponsor fields by editing them manually. They all have @@@@, which could, I think, just be replaced with @@. According to the curation task, these are the items: 1025, 1031, 1032, 1048, 1049, 1054, 1056, 1057, 1058, 1059, 1060, 1061, 1066, 1067, 1071, 1072, 1073, 1074, 1078, 1125.

TomazErjavec commented 7 years ago

However, one of these items (11356/1125) is very new, so, hmm, unless an admin copied/edited the local.sponsor metadata by hand perhaps ... @TomazErjavec, any comments?

Indeed, this item is the newest one in the repository, and it wasn't copied manualy. So, this is not a problem with an old version of the repo but a current problem that will reappear every time a new item is added.

In any case, we could fix the existing local.sponsor fields by editing them manually. They all have @@@@, which could, I think, just be replaced with @@.

Yes, this string appears with all eu licences and the replacement seems to fix the problem.

According to the curation task, these are the items: 1025, 1031, 1032, 1048, 1049, 1054, 1056, 1057, 1058, 1059, 1060, 1061, 1066, 1067, 1071, 1072, 1073, 1074, 1078, 1125.

I did this, i.e. corrected the listed items (not that I enjoyed it:). But, as said, the problem will again crop up with any new items produced with EU funding.

cyplas commented 7 years ago

Hmm, I created a new item now on beta.clarin.si (https://beta.clarin.si/repository/xmlui/handle/11356/1052) with EU funds, and it doesn't manifest the problem. But, it also doesn't have dc.relation metadata, and I doubt it was added manually(?). @TomazErjavec, could you tell me how to reproduce this problem, so I can try it on beta? I.e., how do I create a submission online which has the @@@@ in local.sponsor and dc.relation?

TomazErjavec commented 7 years ago

could you tell me how to reproduce this problem, so I can try it on beta? I.e., how do I create a submission online which has the @@@@ in local.sponsor and dc.relation?

It seems I can't, I created a submission on fido and @@@@ seems to be ok. I gave the submission 3 EUfunds projects: 1 I just entered, 1 I used a project already entered (MONDILEX), and 1 is a current running project for which I got its ID. Curation seem to be ok, except for:

Task:  Fix OpenAIRE Metadata
The task was completed successfully.
RESULT [curated 0 items]:
Processing 11356/1126:
Failed to parse id "MONDILEX", probably it's not in expected format (FP-ICT-2014-1-123456).
The raw metadata value is "FP7 Capacities@@MONDILEX@@Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources@@euFunds@@".
Caught exception: -1

Strange things in this (and also other) curation reports:

As for the last point, this fails for all old entries that have EUfunds project eg. http://hdl.handle.net/11356/1042 for the CONCEDE project.

Not sure how much all this helps..

cyplas commented 7 years ago

It seems I can't, I created a submission on fido and @@@@ seems to be ok.

Why on fido?? Isn't fado ideal for this kind of testing? :)

  • How was the task completed successfully if it returns an error?

I think that "task completed successfully" just really means "task completed", i.e., that the java program didn't crash. Hmm, do you know of cases where a curation task completes without this message? I could customise the wording here, if desired.

  • Why does it say it curated 0 items, if I gave it 1 item to curate and it reports results on it?

Hmm, I just tried this exact same curation test and got identical result/error, except that it says "curated 1 item". I do get 0 if I type in a handle which doesn't exist (but then of course there's a different result to the curation task). Strange ... let's keep an eye out for this and let me know if you notice(d) a pattern.

  • and, of course, why did it fail to parse MONDILEX, and can we fix this?

I've looked at the java code (FixOpenAIREMetadata.java) and it's clear that the code expects the second part of the @@-delimited local.sponsor to be a dash-delimited code where the last part is an integer, which it uses as an id. For instance, if local.sponsor was "FP7 Capacities@@FP-ICT-2014-1-123456@@Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources@@euFunds@@", then id = 123456. But this fails, because MONDILEX can't be parsed accordingly.

If I'm not mistaken, the second part of local.sponsor comes from the "Grant no. or funding project code" value during item submission, which is selected from a generated list of options, when EU is selected in "Funding type". Some of the many items in this list have a dash-separated number-final value, but some don't.

So, there's clearly incompatiblity somewhere, but I don't know whether the issue is with the code handling, the curation task, the code itself, or openaire's requirements ...

TomazErjavec commented 7 years ago

It seems I can't, I created a submission on fido and @@@@ seems to be ok. Why on fido?? Isn't fado ideal for this kind of testing? :)

Sorry, first time I was doing this and late at night. Won't happen again!

How was the task completed successfully if it returns an error? I think that "task completed successfully" just really means "task completed", i.e., that the java program didn't crash.

Ah, ok.

Hmm, do you know of cases where a curation task completes without this message? I could customise the wording here, if desired.

I don't and I don't think it's worth the bother. Ss long as we know what's what..

Why does it say it curated 0 items, if I gave it 1 item to curate and it reports results on it? Hmm, I just tried this exact same curation test and got identical result/error, except that it says "curated 1 item". I do get 0 if I type in a handle which doesn't exist (but then of course there's a different result to the curation task). Strange ... let's keep an eye out for this and let me know if you notice(d) a pattern.

OK.

and, of course, why did it fail to parse MONDILEX, and can we fix this? So, there's clearly incompatiblity somewhere, but I don't know whether the issue is with the code handling, the curation task, the code itself, or openaire's requirements ...

That narrowed down things :) OK, best to leave this one for when a LINDAT guru is available I guess.

kosarko commented 7 years ago

"local.sponsor is a component with 5 values but is not stored as such."

I think this test sometimes has false positives, but definitely worth checking.

My best guess is that local.sponsor was previously defined to have 6 values

There were 4 fields once; don't think we had six

@@@@

These might be fine from time to time, the submission allows empty "fields"

local.sponsor_ac reflects what's currently in your database + if you've selected type of funds EU there's another "source" added. Basically https://github.com/ufal/clarin-dspace/blob/clarin/dspace/config/openaire-cache.list is queried based on what you type.

The FixOpenAire curation is there to update old records - where we had project ids before we had openaire ids - and to update when you add new set of ids, e.g. when we added the H2020 source from openaire. There was some sort of bug, that would add another field even when you already had 5 fields. https://github.com/ufal/clarin-dspace/pull/514 If you have more than five fields, you'd probably need to fix that by hand, or write some utility class doing that for you, depends on the scale of the problem.

Btw MONDILEX looks like a project name, note a project id, so you might want to check whether the fields are stored in the right order (I know strings with a structure are a mess at times). And a last note, the parser was written for id's entered by our users, so it might not work for you. We had project ids that would look like FP7-?-211938 (in case of MONDILEX), so the parser would split that somehow and try to search for that in openaire-cache.list

Hope that helps.

cyplas commented 7 years ago

Hmm, that helps and I've looked into it, but unfortunately I'm still not getting anywhere. Both your openaire-cache.list and ours have an entry like this:

    <pair>
      <displayed-value>211938 - MONDILEX - Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources</displayed\
-value>
      <stored-value>info:eu-repo/grantAgreement/EC/FP7/211938</stored-value>
    </pair>

Also https://lindat.mff.cuni.cz/repository/xmlui/choices/dc_relation?query=mondilex and https://beta.clarin.si/repository/xmlui/choices/dc_relation?query=mondilex return the same thing:

 <?xml version="1.0" encoding="UTF-8"?>
 <Choices xmlns="http://www.w3.org/1999/xhtml" start="0" total="1">
   <Choice authority="934" value="info:eu-repo/grantAgreement/EC/FP7/211938">211938 - MONDILEX - Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources</Choice>
  </Choices>

So I don't understand how adding new items and selecting MONDILEX in the funding autocomplete yields bad values for us (and presumably not for you), like:

FP7 Capacities@@MONDILEX@@Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources@@euFunds@@

Here, as you point out, the second element, MONDILEX, looks like a project name, not like a (dash-delimited) project ID. But there is no ID-like element to be found. Any idea why this is happening to us and not to you?

kosarko commented 7 years ago

Any idea why this is happening to us and not to you?

My guess would be that the suggestions comes from database, not from the authority/choices plugin. Have you selected Funding Type 'EU'? Does MONDILEX appear only once? Is there any "used by # submission(s)" under the Project name in autosuggest?

cyplas commented 7 years ago

Ok, after further consultation with @kosarko, I understand a little more, though I'm still not sure how to solve this (see the CONCEDE example below). As Ondrej says, the suggestions come from two sources: one provides valid (and possibly previously unselected) sponsors, which have the correct format, while the other is actually previously used sponsors (and shows up with "used by X submission(s)" underneath), which may or may not have valid format. Moreover, the latter always show, while the former only show if EU funds was selected (since they are EU-specific).

Therefore, as long as we have items with EU-funded local.sponsor values which don't have the expected format, we may continue to have this problem show up for new items. We've solved part of the problem (replacing @@@@ with @@, so local.sponsor has 5 items, as expected), but there remains the problem that some local.sponsor values don't encode an integer ID properly (the second part of the @@-delimited local.sponsor should be a dash-delimited code where the last part is an integer).

For example, concerning MONDILEX, we have 4 occurrences of the "invalid" "FP7 Capacities@@MONDILEX@@Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources@@euFunds@@:". This is the only option that shows up if you type MONDILEX in the "Grant no..." field, UNTIL you select EU under "Funding type", in which case a second (and valid) option shows up.

However, this is not the pattern for all the problematic items. For instance, "EU Copernicus@@CONCEDE@@Consortium for Central European Dictionary Encoding@@euFunds@@:" is equally problematic, but there's no properly encoded alternative, even if you select EU funding. Hmm, what would be the correct value of local.sponsor in this case? @kosarko, @et: what do you think?

As Ondrej pointed out to me, all the values of local.sponsor can be easily seen along with frequencies via Control Panel -> Metadata Quality -> local.sponsor. As can be seen, there are quite a few improperly encoded values.

cyplas commented 7 years ago

If I now understand, the right solution is (a) to fix the invalid values (via Control Panel -> Metadata Quality -> local.sponsor), which will prevent these invalid ones from being displayed as candidates in the future, and then (b) to make sure that when adding items with EU funding in the future, we always select EU and then use the autocomplete.

Of course, this presupposes that the autocomplete source includes all of our possible valid openaire projects. I think this comes from openaire-cache.list (https://raw.githubusercontent.com/clarinsi/clarin-dspace/clarin/dspace/config/openaire-cache.list), which is populated by openaire-refresh-list, configured in turn with URLs by openaire.cfg.

cyplas commented 7 years ago

I think we've fixed the issue with the problematic IDs now, by either correcting the identifiers, or (for some old projects, for example) changing the funding type from EU to Other. @kosarko, Thanks for your help.