Graph URI and Name - Githubissues

Gonsco commented 1 month ago

Hi,

This issue continues with the topic started in this other one regarding the format of URIs for regulatory graphs, and where @nataschake proposed .../${country}/${file}/${language}/${version} ?

After analyzing it, I see some inconvenient aspects with the use of the {file} field. From the RFT side, we are using the name that the user enters as project name as the file name, which can be very long and with uncontrolled characters. There are various ways to solve this last issue, but I think it would be convenient to think about other options. One of them could be the definition of a classification for the regulations and use this category (should be selected by the user), considering that there should not be more than one regulation per category (e.g., accessibility). e.g., https://graphdb.accordproject.eu/resource/aec3po/FI/accessibility/en-GB/v1 What do you think @beachtom @nataschake @VladimirAlexiev ?

Also, are we definitely going to add the version field in the URI even though @beachtom proposed to add the published date instead, if I am not mistaken?

beachtom commented 1 month ago

So ${file}: So the concept of ${File} is actually some form of short identifiable string - in the RFT we could ask the user to enter this - or we could generate it from the longname by i.e. removing characters or spaces.

We do need the version - and I think we should be flexible about what type it takes. Many documents will use a year as a date some will use a version string i.e. 1.5 etc... I think we need to support both.

In terms of classifications we cannot be certain there can only be one for each classification i.e. in the UK there is two for accessibility. When we looked at this for the UK context we came to the view that letter the author/publisher of the document to chose.

Gonsco commented 4 weeks ago

About ${File}, I am not convinced by ask the user to enter a short identifiable string. For example, in your case @beachtom, how do we know that two users not want to publish the two differente accessibility UK regulations using the same: .../UK-ACC/...? Also, while the field is aimed toward identifying regulations, the way to provide a name to identify them can vary greatly in the user's mind. I find the classification option more interesting: we know explicitly what type of regulation it is (we provide the meaning via some standardized way, one that the user must select). And having more than one regualtion for a classification category should not be a problem. The classification only does is classify. One solution for the problem of having more than one could be to add a 'type' numbering. For example: https://graphdb.accordproject.eu/resource/aec3po/UK/accessibility/1/en-GB/v1 https://graphdb.accordproject.eu/resource/aec3po/UK/accessibility/2/en-GB/v1 (We 'only' would need to control the number of published regulations of the same 'type' and 'country'). In any case, I think that both approaches can be valid. So, maybe it would be interesting to hear more opinions, if possible, before choosing one of the two approaches (@nataschake @VladimirAlexiev @maximelefrancois86 ....)

Regarding the ${version}, Before entering into the discussion, I think we should differentiate the two meanings of the concept of 'version' so as not to get confused and make sure we are on the same page:

At the 'authority' level. For example, the Finnish government published an accessibility regulation in 1999, then published another one in 2012. These are two 'versions' from the point of view that they have been generated/approved/officially published by the corresponding competent authorities. Within the RFT, these would be two different projects, even if carried out by two different users.
At the 'RFT user' level. For example, a RFT user publish the Finnish accessibility regulation (2012), then he/she realized he had made mistakes in the RASE markup and published a new version. So, first important question, do we consider this case or should I override the result using the same URI assuming that there can only be one 'version' of each regulation and therefore, in reality, there is no version field needed?

Some conclusions: I think that the 'official' publication date of the regulations by the authorities (not from a RFT user publication) should be another field in the URI. For example, If it makes sense that more than one version of the same regulation (officially approved by the competent authorities) can be published from the RFT, then the URIs could be: https://graphdb.accordproject.eu/resource/aec3po/FI/FI-ACC/en-GB/17-9-2015/v1 https://graphdb.accordproject.eu/resource/aec3po/FI/FI-ACC/en-GB/17-9-2015/v2 .... If not: https://graphdb.accordproject.eu/resource/aec3po/FI/FI-ACC/en-GB/17-9-2015

What do you think?

beachtom commented 4 weeks ago

So ${file} - my point here is often there are already existing short names used by the relevant authorities - I think perhaps we could suggest a classification but let the user override it. For example in the accessibility regulations are classified as "Part M" and that is what it is known as in the industry. I think the short name should be unique - i.e. we should not mix versions of the same document and versions of the different document.

For version, I am happy to settle on using the publishing date. For me, we should separate versioning for the sake of drafting - to versioning of published versions.

For your examples I think we should use https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/17-9-2015

because if we add /v1/v2 it implies they publish two versions on the same day which does not seem likely. I also changed FI-ACC to ACC - so we had FI in two places.

I would be happy with

https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/17-9-2015

But ACC should be user definable as well i.e. for the UK pilot

https://graphdb.accordproject.eu/resource/aec3po/EU/EuroCode3/en-GB/01-01-2015

(I just made up the date here)

Gonsco commented 4 weeks ago

I don't see an approach where a classification can be used but the user can override it. The system would not be deterministic and we lose the advantage. With that my idea was not to help the user put a name there, but to be able to infer types of regulations on the microservice side from the URI without the need to make other types of inferences and guesses, for example. But I don't know to what extent this can be useful and/or an advantage for something right now, honestly. And if it is true that there may be already existing short names used by the relevant authorities, the problem remains that we cannot control what the user will end up putting in this field, even if we limit it to 10 characters maximum, for example. My preference here would be for one option or the other, but not a mixed one.

Regarding the ${version}, So you propose not allowing versions of the same regulations to be generated, but rather overwriting existing ones. I wonder what happens in the opsite case you rised if someone modifies a document when many microservices are already using the content of the previous graph. Could this be a problem? Or are we just going to assume that this will almost never happen? (RFT user who realized he had made mistakes in the RASE markup and published a new version two months later). If ACC should be user definable, we can have the same as before as a possible URI: https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/17-9-2015 https://graphdb.accordproject.eu/resource/aec3po/FI/FI-ACC/en-GB/17-9-2015 https://graphdb.accordproject.eu/resource/aec3po/FI/Act 958-2012/en-GB/17-9-2015 ....etc. It is this possible variation in the hands of the user that does not convince me, but let's see if someone else can finish giving opinion feedback to this issue.

beachtom commented 4 weeks ago

Name: Then I think we must go for allowing the user to enter it - with restrictions that we need to decide. i don't believe we can create such a classification in the timescales we have that will work for all EU nations.

Version: We should never allow overwriting of a version, they must be immutable. So if we see a use case for more than one version per day - then we should include a time. This has some precedence. I.e. the version is the time + date it is published.

Gonsco commented 4 weeks ago

Name: This classification may simply be a list with the N categories required for the current pilot cases in ACCORD that can always be expanded in the future. I don't think this is the issue but whether it is actually a better approach or not, possible benefits for the side of microservices, etc...

Version: I thought you were suggesting earlier just overwriting a version instead of creating versions of it (V1, V2...). I wouldn't focus on 'day' as a time unit. I guess the RFT user could realise that he/she made a mistake, for example, by someone telling him/her so two months later. Then what? Are we considering this scenario? Does it make sense?

beachtom commented 4 weeks ago

Name: I still think this will be challenging. If we want to model this categorisation we have metadata in the knowledge graph we can use. I am personally against inferring anything from the URL in the microservices. For example in FI we have two accessibility regulations - ones for schools and one for other regulations... and we may have regulations that transcend the categories. I think we are making problems for us later down the line.

Version: So shall we use a date/time string to represent the version? If someone makes a mistake once a document is published and an amendment is produced it should be a new version. As once published the regulation should be a matter of record

Gonsco commented 4 weeks ago

After a meeting with Thomas, we agreed on:

The general scenario is one where the graph version of the regulation is published on the same day as the official publication of the regulation in PDF format.
The user can define this date manually. So, this is compatible with the non-general case. For example, a retrospective scenario where a graph is published months or years after a regulation is officially published.
If someone realises that the graph is wrong, a new one must be generated as a different version. Both must be kept. The publication date of the second graph is what changes: it will contain the publication date of when the new version of the graph is made public (different from the publication date of the PDF). So, the ${version} param is removed from the URI, Example: https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/2024-4-20 https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/2024-7-17

@beachtom anything you want to clarify or correct, go ahead.

Aboute if a classification should be used or not remains open. One option could be a classification that could be extended, perhaps by the user itself, but always in a supervised/approval manner. If there are no comments on this, we can try to propose an approach from the RFT side and discusse it.

If anyone has any different suggestions, please do so soon before closing this issue.

nataschake commented 3 weeks ago

@Gonsco Hi, I don't envisage any issues with $version removed from $URI. $classifier can be declared and extended not edited (so not to loose obsolete classes) iteratively, in e.g. SKOS concept scheme, so we add

@base          <https://graphdb.accordproject.eu/resource/aec3po/> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .

<classifier> a skos:ConceptScheme.

<classifier/ACC>
  skos:notation "Accessibility of Building";
  skos:inScheme <classifier> .

<classifier/CO2>
  skos:notation "Carbon Footprint";
  skos:inScheme <classifier> .

etc. In case this will be accepted, RFT should take care of the list of classes in this ConceptScheme (and issue proper SPARQLs to insert them)

beachtom commented 3 weeks ago

I am confused as well - I thought the classifier was just a tern in the URI to the document

Gonsco commented 3 weeks ago

Hi @nataschake @beachtom Sorry, I deleted my last message because I missed @nataschake's last sentence which probably gave the oposite meaning to what I had read. Apologies. So, I understand that what you are proposing @nataschake is to add the classification information inside the graph. Although this may be an option to consider in the future (it was not my idea to go so far and add definitions inside the ontology until all this is more mature), I think we can keep the original approach for now and not include this information in the graph.

Gonsco commented 3 weeks ago

Hi, Regarding the versions, and only to close definitively this issue and be sure we are on the same page (and the nuances are important), could we verify this with the following examples?

Retrospective case scenario (let's say it is not the typical one): Let's say a user wants to generate a graph from the Accessibility of Buildings approved 17.9.2015 regulation PDF document. He/she create a new project. The document is loaded to the RFT in 2024.05.11 The RASE method is applied and published using the following URI (considering this is a retrospective case) in 2024.05.21: https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/17-9-2015

Then, in the same day, the RFT user realizes that he/she made a mistake in the RASE tagging, so he/she republishes using the same URI: https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/17-9-2015

A month later someone alerts that the RASE tagging is incorrect and a new version of the graph must be published. https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/21-6-2024

Regular case scenario: Let's say a user wants to generate a graph from the new Accessibility of Buildings approved 30.7.2024 regulation PDF document. He/she create a new project. The document is loaded to the RFT in 2024.08.01 The RASE method is applied and published using the following URI in 2024.08.06: https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/30-7-2024

Then, in the same day, the RFT user realizes that he/she made a mistake in the RASE tagging, so he/she republishes using the same URI: https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/30-7-2024

A month later someone alerts that the RASE tagging is incorrect and a new version of the graph must be published. https://graphdb.accordproject.eu/resource/aec3po/FI/ACC/en-GB/08-9-2024

The tool will retrieve the names of the Ontotext URIs help user to know about how many versions are published. In any case, only the latest version will be saved in the RFT. The previous version cannot be retrieved: there is no way back from JSON to HTML and no versions will be saved within the RFT.

De we agree?

beachtom commented 3 weeks ago

Agree with all.

Except if we have two different versions on the same day we need some way to differentiate this - but I think this is very much an edge case so we should not worry about it.

nataschake commented 3 weeks ago

The tool will retrieve the names of the Ontotext URIs help user to know about how many versions are published. In any case, only the latest version will be saved in the RFT. The previous version cannot be retrieved: there is no way back from JSON to HTML and no versions will be saved within the RFT.

May be I misunderstand, but how RFT will know of the previous versions, if only the latest is stored in GraphDB? Or GraphDB will keep old/obsolete versions as well, but will not expose them?

Gonsco commented 3 weeks ago

But, does it make sense to want to have two versions of something published on the same day? I am open, but is there really any reason to want this? Also, it is possible to overwrite a graph version in the Ontotext RuleDB: does the current API functionality allow this? I think @nataschake said it was read-only, but I am not sure.

Gonsco commented 3 weeks ago

@nataschake In the current approach, GraphDB is supposed to store different versions that are distinguished only by the date (it doesn't have to coincide with the publication date, as you can see in the examples I have provided). One of the question is whether a graph can be rewritten in the GraphDB. Is that possible? (to be sure)

beachtom commented 3 weeks ago

I just don't see us ever publishing two versions on the same day. In reality, there should be robust testing processes in place to mean things like this do not happen.

We can overwrite them in GraphDB but my argument is we should not.

nataschake commented 3 weeks ago

cURL PUT overwrites the graph, see https://github.com/Accord-Project/API-Development/blob/main/BuildingCodesAndRules/graphdb-webapi.md#put-repositoriesrepositoryidstatements

Gonsco commented 3 weeks ago

I understand your point @beachtom, it shouldn't happen, but if it does I don't think it should be tragic overwritten it. Another thing is when more days go, using another date, this can solve that. Then everything fits. Right now I see this as a compromise solution in my opinion.

Accord-Project / API-Development

Graph URI and Name #13