SSHOC / sshoc-marketplace-backend

Code for the backend
Apache License 2.0
2 stars 0 forks source link

Encoding of vocabulary concepts code #423

Open KlausIllmayer opened 8 months ago

KlausIllmayer commented 8 months ago

Currently, you are quite free in creating a concept for a vocabulary with status open (which is the case for the sshoc-keyword-vocabulary) by giving any value for code. This can lead to really wrong and even unsafe urls that are created based on this value. Take this:

POST /api/vocabularies/sshoc-keyword/concepts?candidate=true

{
  "code": "i have spaces inside"
}

leads to this result:

{
  "code": "i have spaces inside",
  "vocabulary": {
    "code": "sshoc-keyword",
    "scheme": "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/",
    "namespace": "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/",
    "label": "Keywords from SSHOMP",
    "closed": false
  },
  "label": "",
  "notation": "",
  "uri": "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/i have spaces inside",
  "candidate": true,
  "relatedConcepts": []
}

You see, that the created URI also does have spaces inside. This could still work by url encoding it and thus addressing it via https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/i%20have%20spaces%20inside But we run constantly into problems, starting with the bug, that if I create a full export of the vocabulary (GET /api/vocabularies/sshoc-keyword/export) it wrongly deals with this situation. In the created turtle file the concept itself is correctly encoded:

<https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/opinion%20mining> a skos:Concept;
  skos:notation "opinion mining";
  skos:topConceptOf "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/";
  skos:inScheme "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/" .

but not when applying it as a top concept:

<https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/> skos:hasTopConcept [...]  "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/opinion mining", [...]

This would be bug to solve. But it is also possible to use the "/"-character for creating a concept: POST /api/vocabularies/sshoc-keyword/concepts?candidate=true

{
  "code": "i/slash"
}

leads to

{
  "code": "i/slash",
  "vocabulary": {
    "code": "sshoc-keyword",
    "scheme": "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/",
    "namespace": "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/",
    "label": "Keywords from SSHOMP",
    "closed": false
  },
  "label": "",
  "notation": "",
  "uri": "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/i/slash",
  "candidate": true,
  "relatedConcepts": []
}

Here we start to have problems because I can't access the concept anymore via the api. It will give me for GET https://marketplace-api.sshopencloud.eu/api/vocabularies/sshoc-keyword/concepts/TCP/IP a 404 (and we have this example currently in production, see https://marketplace-api.sshopencloud.eu/api/concept-search?page=1&f.candidate=true&types=keyword&q=TCP

I had the impression, that there was once an encoding in the background, done by backend, so that "i/slash" becomes "i+slash" and "i have spaces inside"becomes "i+have+spaces+inside" but when looking into the issue history I see for #147 that it seems that Michał proposed to deal with this in the call, so using such an encoding manually, e.g. POST /api/vocabularies/sshoc-keyword/concepts?candidate=true

{
  "code": "i+slash",
  "label": "i/slash"
}

The other option would be to explicitly address the uri: POST /api/vocabularies/sshoc-keyword/concepts?candidate=true

{
  "code": "i/slash",
  "label": "i/slash",
  "uri": "https://vocabs.sshopencloud.eu/vocabularies/sshomp-keyword/i+slash",
}

Both options are not very attractive, as we can't communicate this behaviour very well to external sources (we communicated it to frontend so it is handled there by removing special characters).

The decision is therefore if we implement an automatic behaviour in backend that will replace special characters in the code of a concept with e.g. the +-charachter. @tparkola @laureD19 @vronk what do you think?

After deciding if we change the behaviour we need to think about how to deal with the old concepts where some of them are not valid afterwards.

tparkola commented 7 months ago

The simple question is whether we can assume that all codes are URL-encoded? And when you access the concept you also need to provide URL-encoded code? Would that make sense? See https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URLEncoder.html and https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URLDecoder.html to understand what I mean.

laureD19 commented 2 months ago

after our discussion today, let's proceed as you suggested @tparkola