fusepool / fusepool-vocab

The Fusepool vocabulary
1 stars 0 forks source link

Split ontologies for Patents, Publications, Funding, and general purpose #3

Open retog opened 11 years ago

retog commented 11 years ago

Increase reusability by having smaller dedicated ontologies.

csarven commented 11 years ago

I would normally agree with this approach, however, given that the number of terms is (quite) small, "splitting" would mean creating different namespaces for each dataset, and that may be premature at this time.

Moreover, there is no "Fusepool" data at this time i.e., the service is primarily using external datasets, and so I'm not sure how that will evolve going forward. There are user annotations, however, something like that should probably be using one of the existing Annotation ontologies out there.

In other words, Fusepool mosly aggregates and enriches data as opposed to publishing new datasets that's not previously published which would require a whole new vocabulary. From that perspective, I actually don't see significantly new vocabulary or ontology development. Naturally, some will arise due to shortcomings of the existing ontologies out there (which we can see from the current state of the vocab), but again that's not something to write home about. There would be some ontology alignment, but that's a separate issue.

If anything, I would consider looking more into possible ways to generalize some of the existing terms in the current vocab.

In any case, I suggest that we wait out on this one and see how Fusepool evolves a little more and then come back to this issue i.e., until it goes public.

retog commented 11 years ago

I think there should be one dereferenceable URI per vocabulary, even if the vocabulary is small. If we can use existing vocabularies we should do so. But having one derefrenceable uri for "a group of vocabularies that is used on the Fusepool platform" makes no sense to me. Vocabularies should be there to model a speicific domain and not to group together the terms used of a piece of software (after all its about interoperability).

Especially if there is no signifant ontology development going on you shouldn't take such a broad uri as you take for your terms. But use uris like fusepool.eu/ontologies/patent-description#. If we don't need and can use existing vocabs the better.

csarven commented 11 years ago

Vocabularies are not solely for a specific domain e.g., schema.org. There is no should here as far as adoption goes, but there certainly is a cost associated to each namespace that's created.

The experience so far tells me that the created terms are really to fill in the gaps of existing vocabularies. We are probably looking at it from different angles because I don't see how the vocabs that I'm producing have anything to do with the software. They are entirely for the datasets that's published at fusepool.info.

As I already mentioned, IMO, Fusepool doesn't really have its own data* which warrants several namespaces for the few extra terms it uses for the datasets it happens to (re-)publish. Some of the terms are already general enough that they can be used across datasets e.g., validFrom, validTo. There is no need to create multiple namespaces and the same (or if not an awfully similar) in other namespaces. It also forces us to align those descriptions with one another. These are unnecessary costs they I'd rather avoid.

My preference is to see how things evolve over the next months before jumping too far with this.

*based on what I've seen, but I open to be corrected.

retog commented 11 years ago

apologize the mistakes in the original version of this comment

csarven commented 11 years ago

The example of schema.org was to point out that the claim which you've made that "vocabularies should be there to model a speicific domain", is not true and certainly not written in stone anywhere.

As far as "guessing the meaning from the URI": I'm not sure whether you are evaluating the usefulness of the current state of the vocab's description given that it is stated as version 0.01, but at least it contains rdfs:labels which is of some use. This is in contrast to ECS which has no labels. So, yes, a public-ready version would have all the bells and whistles with proper labels, comments, definitions, provenance or whatever.

What I'm saying is similar to the approach schema.org is taking; lets first see how the terms are used in the datasets that we use. The current terms are coherent enough given that it is version 0.01. If you take a closer look, you'll see that fp:filingOffice will be tied with the list of filing office concepts and concept schemes. Sometimes the domains and ranges are not set right away to allow flexibility. So, IMO, for the time being, it is a good thing that fp:filingOffice has a range to skos:Concept and not something more specific.

There is nothing wrong with "going from Topic to filingOffice". They are themselves broad enough. "Topic" as the name implie. Same goes for filingOffice.

What's skos:Topic?

"Data cannit be seen detached from software. The goal is to have your software understandable by software. Real software, I'm not talking about theoretical software running on computer with infinite amounts of resources.", is not clear to me but I'll give it a go:

If anything, data should be detached from particular software so that it can be reused by other software with minimal investment. Stanbol or some other software x is not the only user of this data. You already do know that the whole point of having Linked Data is to have some data available in the "giant global graph" without having it tie to a strong schema. So, if some software imposes how the original data should be shaped or described, that's problematic in my eyes. Besides, the original data that Fusepool is working with is just that. It has not knowledge about anything else. Therefore, the vocabs that are used shouldn't have to be aware of some specific software either. Enhancements can be written to do that separately, but certainly not tied with the original data and vocabulary in the first place. One last point on this: if anything, the data and the vocabulary will outlast the software. Software and languages comes and go, and depend highly on the group of individuals that wants to work with the data. So, from that perspective, I would certainly not overly tie anything to Stanbol or whatever.

I don't understand this statement: "As "Fusepool doesn't really have its own data" let it be altogether. Introducing ney synonims for existing terms is pointless and contraproductive." to well either. I'm not sure which new synonyms for existing terms you are referring to? Can you elaborate?

re: "While the cost of a namespace is more a theoretical one the cost of having to change the IRI for terms used in software is a very real one.", can you justify how you derive at the point of cost of namespaces is theoretical? Afterall, I explicitly pointed out the cost of alignment earlier. The cost of namespaces is well-discussed and experienced elsewhere on the Web, so, it is probably fruitless for us to dwell more into that here. Again, my focus is to minimize that cost, and I really don't see significant amount of domain specific terms that Fusepool is currently using to justify that they deserve their own namespaces. I want to see what happens.

Let me inject a point here. I am talking about the vocabulary that's used in the datasets. So far it looks like we don't have too many new terms, hence one namespace might be fine. This is irrespective to having other namespaces for everything else. If ECS, Stanbol or some software needs its own, nothing is stopping it to go ahead with that. Which is I think well-justified on its own any way. The consumers of the data (and the vocabs it uses), doesn't need to know anything about the vocabulary the software that happens to help to publish that data. In fact, as a consumer, I wouldn't want the namespaces to be polluted as such.

I don't know what made you say "please before you assignmeaning to an IRI check".

"Isn't there an existing IRI with the needed meaning" - we can certainly look more into this and re-use any that's out there that we may have missed in the first place.

"Is the chosen IRI reasonably to be stable." - I hope so :) This is why I'm open to the TLD and the path to term. I think http://fusepool.eu/vocab# is simple enough for base, but I don't mind going with individually derefenceable terms either.

"Especially for IRIs with a fragemnt identifier this means to assure that the part before the fragment will group a small and coherent set of terms." - exactly, as I've already mentioned several times that the number of terms that Fusepool is coming up with is small, so the fragment identifier suits well for this. Moreover, substantial amount of vocabs out there use fragment identifiers (even the ones that are significantly larger than Fusepool's), but the reasoning here is not based on that. Finally, for small vocabs, this approach works out well because consuming software needs to only make a single call (unless there is an isDefinedBy to a single resource) - [Aside: cost of namespaces].

retog commented 11 years ago

The existing data uses the property http://fusepool.info/property/filing-office just to mention an example. Will the new data you rdfize be different? To avoid unnecessary additional changes can we agree on the IRIs once the meaning are defined and we have checked that no existing ontology with a close enough term exist?

I agree that labels may be useful. But if you intent to say that the not providing labels is worse than not proving definitions I would disagree. Anyway, instead of a tu-quoque-argument here raise an issue against ECS.

I'm surprised of the range of filingOffice. I was thinking it point to an office being a type of agent rather than to a concept. But it's pointless to discuss as long as I don't know what the meaning of the property actually is.

csarven commented 11 years ago

The existing data will need to be retransformed (the code is already in). I just need to change the TLD actually to eu. AFAIK, there is no ObjectProperty that conveys a "filing office". But, certainly, we can hunt for it again. The patexpert ontology comes with a DatatypeProperty for pmo:countryOfFiling. I think that's a "shortcoming", hence I created fp:filingOffice so that we can eventually extend the description and also interlink the resource it points to with other resources. The latter point is important because locations are fairly important. pmo:countryOfFiling is still there in the final data.

I slightly got caught up with filing offices primarily being countries in the vocabulary (even though I explictly say it is a "patent filing office"). Both patexpert (the patent ontology that we use) and Espacenet (which gives the codes) call it a country or organization list. So, it is a grey area because they are both concepts and agents. I'll revisit this next week when I'm in the office, but foaf:Agent is probably the right way to go, because after all, they are about tangible entities. Thanks for catching that.

I have no issues with how ECS models or describes the things it needs to use. I was only making the point that the current state of the dataset vocab is not simply loading the IRI with a meaning and backing off, but that there is at least a label for starters.

csarven commented 11 years ago

Since PATExpert ontologies are no longer derefernceable (the domain name was taken over by some not so cool folks), this is also a problem for us. A resolution of that turns out, solves our problem of whether to have a new namespace for patent terms that we came up with:

I got in touch with the authors of PATExpert ontology. After discussing how we can proceed to revive the ontology, the decision was that they will use w3id.org for the namespace redirection, and that they will host the ontology again on their servers. Going forwad, PATExpert will be on a Github repository and so we will coordinate with them there about our new terms.

For the remaining terms in vocab i,e., publications and funding, there are no new terms for publications, and all of the terms that's used for funding dataare already generalized. That is to say that, the terms are not particular to a domain (maybe a much larger domain), so, they can stay within the main vocab namespace.

I would say that this resolves this issue but since you opened it, you can decide ;)