Wimmics / solid-start

Projet SOLID Inria - Startin'blox
MIT License
1 stars 0 forks source link

Indexing persons by the first letter of their family name #13

Closed lecoqlibre closed 10 months ago

lecoqlibre commented 1 year ago

How can we find persons with a family name starting with a certain letter?

Option 1: using anchors

Here is a kind of a hack to index persons by the first letter of their family name using anchor:

@prefix : <#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

# This is indexing persons with a family name starting with the letter "a".
:a dfc-b:references 
    </agents/persons/person32.ttl>,
    </agents/persons/person12.ttl>.

# This is indexing persons with a family name starting with the letter "b".
:b dfc-b:references  
    </agents/persons/person56.ttl>,
    </agents/persons/person78.ttl>.

# This is indexing persons with a family name starting with the letter "z".    
:z dfc-b:references  
    </agents/persons/person2.ttl>,
    </agents/persons/person63.ttl>.

To find all persons with a family name starting with the letter "z" we can get the location /path/to/the/index.ttl#z.

Option 2: using subjects

Another option would be to make letters becoming subjects like:

@prefix : <#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

# This is indexing persons with a family name starting with the letter "a".
dfc-b:startingWithA dfc-b:references 
    </agents/persons/person32.ttl>,
    </agents/persons/person12.ttl>.

# This is indexing persons with a family name starting with the letter "b".
dfc-b:startingWithB dfc-b:references  
    </agents/persons/person56.ttl>,
    </agents/persons/person78.ttl>.

# This is indexing persons with a family name starting with the letter "z".    
dfc-b:startingWithZ dfc-b:references  
    </agents/persons/person2.ttl>,
    </agents/persons/person63.ttl>.

Option 3: using properties

Another option would be to use a familyNameStartsWith property:

@prefix : <#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

# This is indexing persons with a family name starting with the letter "a".
</agents/persons/person32.ttl> dfc-b:familyNameStartsWith "a".
</agents/persons/person12.ttl> dfc-b:familyNameStartsWith "a".

# This is indexing persons with a family name starting with the letter "b".
</agents/persons/person56.ttl> dfc-b:familyNameStartsWith "b".
</agents/persons/person78.ttl> dfc-b:familyNameStartsWith "b".

# This is indexing persons with a family name starting with the letter "z".    
</agents/persons/person2.ttl> dfc-b:familyNameStartsWith "z".
</agents/persons/person63.ttl> dfc-b:familyNameStartsWith "z".

Option 4: using one file per letter

We could have one index file for the letter "a", one for the letter "b" and one for the letter "z".

File indexA.ttl:

@prefix dfc-b: <https://www.datafoodconsortium.org#>.

# This is indexing persons with a family name starting with the letter "a".
<> dfc-b:references 
     </agents/persons/person32.ttl>,
     </agents/persons/person12.ttl>.

File indexB.ttl:

@prefix dfc-b: <https://www.datafoodconsortium.org#>.

# This is indexing persons with a family name starting with the letter "b".
<> dfc-b:references 
     </agents/persons/person56.ttl>,
     </agents/persons/person78.ttl>.

File indexZ.ttl:

@prefix dfc-b: <https://www.datafoodconsortium.org#>.

# This is indexing persons with a family name starting with the letter "z".  
<> dfc-b:references   
     </agents/persons/person2.ttl>,
     </agents/persons/person63.ttl>.

The naming and location of the index files could be directly defined by the client-to-client standard.

Or we could define a new type like solid:FirstLetterIndex that could be indexed in the TypeIndex for instance.

File typeIndex.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.

<>
    a solid:TypeIndex;
    a solid:ListedDocument.

<#ab09fd> a solid:TypeRegistration;
    solid:forClass solid:FirstLetterIndex;
    solid:instance <indexA.ttl>.

<#zx45yh> a solid:TypeRegistration;
    solid:forClass solid:FirstLetterIndex;
    solid:instance <indexB.ttl>.

<#sk17vb> a solid:TypeRegistration;
    solid:forClass solid:FirstLetterIndex;
    solid:instance <indexZ.ttl>.

Adding a solid:forLetter and a solid:forProperty properties could tell us directly were to find the appropriate index:

File typeIndex.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:TypeIndex;
    a solid:ListedDocument.

<#ab09fd> a solid:TypeRegistration;
    solid:forClass solid:FirstLetterIndex;
    solid:forLetter "a";
    solid:forProperty dfc-b:familyName;
    solid:instance <indexA.ttl>.

<#zx45yh> a solid:TypeRegistration;
    solid:forClass solid:FirstLetterIndex;
    solid:forLetter "b";
    solid:forProperty dfc-b:familyName;
    solid:instance <indexB.ttl>.

<#sk17vb> a solid:TypeRegistration;
    solid:forClass solid:FirstLetterIndex;
    solid:forLetter "z";
    solid:forProperty dfc-b:familyName;
    solid:instance <indexZ.ttl>.

Option 5: using one file per letter, TypeIndex style

To replace our custom indexes indexA.ttl, indexB.ttl and indexZ.ttl by regular TypeIndex we could introduce a new kind of registration: solid:FirstLetterRegistration:

File indexA.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:FirstLetterIndex;
    a solid:ListedDocument.

# This is indexing persons with a family name starting with the letter "a".
<#ab09fd> a solid:FirstLetterRegistration;
    solid:forLetter "a";
    solid:forProperty dfc-b:familyName;
    solid:instance  </agents/persons/person32.ttl>.

<#zx45yh> a solid:FirstLetterRegistration;
    solid:forLetter "a";
    solid:forProperty dfc-b:familyName;
    solid:instance </agents/persons/person12.ttl>.

Consider the same modifications for files indexB.ttl and indexZ.ttl.

We can avoid the repetition of the solid:forLetter and solid:forProperty in each of the registrations if we define it in the solid:FirstLetterIndex directly:

File indexA.ttl:

@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:FirstLetterIndex;
    a solid:ListedDocument;
    solid:forLetter "a";
    solid:forProperty dfc-b:familyName.

# This is indexing persons with a family name starting with the letter "a".
<#ab09fd> a solid:FirstLetterRegistration;
    solid:instance  </agents/persons/person32.ttl>.

<#zx45yh> a solid:FirstLetterRegistration;
    solid:instance </agents/persons/person12.ttl>.

Option 6: one FirstLetterIndex

File typeIndex.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:TypeIndex;
    a solid:ListedDocument.

<#ab09fd> a solid:TypeRegistration;
    solid:forClass solid:FirstLetterIndex;
    solid:forLetter "a", "b", "z";
    solid:forProperty dfc-b:familyName;
    solid:instance <indexFirstLetter.ttl>.

File indexFirstLetter.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:FirstLetterIndex;
    a solid:ListedDocument;
    solid:forProperty dfc-b:familyName.

# This is indexing persons with a family name starting with the letter "a".
<#ab09fd> a solid:FirstLetterRegistration;
    solid:forLetter "a";
    solid:instance  </agents/persons/person32.ttl>, </agents/persons/person12.ttl>.

# This is indexing persons with a family name starting with the letter "b".
<#zx45yh> a solid:FirstLetterRegistration;
    solid:forLetter "b";
    solid:instance  </agents/persons/person56.ttl>, </agents/persons/person78.ttl>.

# This is indexing persons with a family name starting with the letter "z".
<#sk17vb> a solid:FirstLetterRegistration;
    solid:forLetter "z";
    solid:instance  </agents/persons/person2.ttl>, </agents/persons/person63.ttl>.
pchampin commented 1 year ago

First, a few general remarks:

General remark G1: several of these options are relying on off-band knowledge. In other words, part of the semantics of the index is left implicit ("which property are we indexing on?", "are we indexing on the whole value, the first letter, the first two letters?..."), which in my view is not good. Caveat: things could be part of the client-to-client protocol (C2CP), in which case it does not need to be explicit in the graphs -- but the more off-band knowledge you defer to C2CP, the less flexible your implementations will be.

General remark G2: encoding information in IRIs is generally frowned upon in RDF. Part of it is related to G1 above: this is putting information inside the IRI (which requires specifc knowledge to decode) rather than as triples around that IRI (which is standard RDF). Part of it is because it causes a combinatorial explosion of the terms in your vocabulary (cf. comment by @FabienGandon this morning).

General remark G3: I don't think that C2CP should mandate specific filenames or container names. I think that specific predicates should be used to indicate the location of specific containers/files. E.g.: even though the public type index is typically stored in $MY_POD/profile/publicTypeIndex.ttl, this is not required by the C2CP. Instead, it is discovered by following the solid:publicTypeIndex property in my WebID.

From there remarks, let me state a few requirements for indexing strategies :

R1: the property on which we are indexing (here, dfc-b:familyName) should be explicit (i.e. expressed in triples) R2: the part of the property value we are indexing (here, the first letter) should be explicit (idem) R3: the indexed value (here "A", "B"...) should be explicit R4: do not encode values in IRIs R5: do not mandate file/container names

R1 R2 R3 R4 R5
Option 1
Option 2 (1) (1)
Option 3 (2)
Option 4.a (3)
Option 4.b (3)
Option 4.c (3)
Option 5
Option 6

(1) arguably, the semantics of "first letter" and "A" can be considered part of the built-in semantics of the vocabulary term dfc-b:startingWithA but that conflicts with R4 anyway.

(2) I know that I proposed something like that this morning, but I now realize that a predicate such as dfc-b:familyNameStartsWith suffers (to some extent) of the "combinatorial explosion" problem raised by @FabienGandon . (Consider dfc-b:familyNameEndsWith, dfc-b:familyNameContains, dfc-b:givenNameStartsWith, dfc-b:givenNameEndssWith...).

(3) I call Option 4.a the variant where "naming and location of the index files could be directly defined by the client-to-client standard", so everything is implicit, including file names. Not a fan :smiling_imp: I call Options 4.a and 4.b the two variants where solid:FirstLetterIndex is used in the type index.


I like, in options 4.c, 5 and 6, that the standard type index is used to make our new kinds of index discoverable. That's indeed a nice to have.

I also like, in options 4.c and 6, that the type index entries are extended with additional properties (solid:forLetter, solid:forProperty), while remaining backward compatible (i.e. ignoring these properties does not lead to a wrong interpretation, only less discriminating).

I don't think having multiple small indexes is a good idea. So all things considered, Option 6 is my favorite.

(except that we should not coin new terms in the solid: namespace -- we do not own that namespace... nitpicking)

lecoqlibre commented 1 year ago

Thank you @pchampin, I also prefer the option 6.

About the prefix naming, I would use solid as long as we don't have something else to propose.

An enhancement of the option 6 could be to make it more generic with the option 7 below.

Option 7: using a generic ValueIndex and ValueRegistration

File typeIndex.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:TypeIndex;
    a solid:ListedDocument.

<#ab09fd> a solid:TypeRegistration;
    solid:forClass solid:ValueIndex;
    solid:forValue "a", "b", "z";
    solid:forPosition 0;
    solid:forProperty dfc-b:familyName;
    solid:instance <indexValue.ttl>.

File indexValue.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:ValueIndex;
    a solid:ListedDocument;
    solid:forPosition 0;
    solid:forProperty dfc-b:familyName.

# This is indexing persons with a family name starting with the letter "a".
<#ab09fd> a solid:ValueRegistration;
    solid:forValue "a";
    solid:instance  </agents/persons/person32.ttl>, </agents/persons/person12.ttl>.

# This is indexing persons with a family name starting with the letter "b".
<#zx45yh> a solid:ValueRegistration;
    solid:forValue "b";
    solid:instance  </agents/persons/person56.ttl>, </agents/persons/person78.ttl>.

# This is indexing persons with a family name starting with the letter "z".
<#sk17vb> a solid:ValueRegistration;
    solid:forValue "z";
    solid:instance  </agents/persons/person2.ttl>, </agents/persons/person63.ttl>.

Any thoughts @FabienGandon

Option 7a: use regex

File typeIndex.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:TypeIndex;
    a solid:ListedDocument.

<#ab09fd> a solid:TypeRegistration;
    solid:forClass solid:ValueIndex;
    solid:forRegex "/^[a|b|z]/i";
    solid:forProperty dfc-b:familyName;
    solid:instance <indexValue.ttl>.

File indexValue.ttl:

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:ValueIndex;
    a solid:ListedDocument;
    solid:forProperty dfc-b:familyName.

# This is indexing persons with a family name starting with the letter "a".
<#ab09fd> a solid:ValueRegistration;
    solid:forRegex "/^[a]/i";
    solid:instance  </agents/persons/person32.ttl>, </agents/persons/person12.ttl>.

# This is indexing persons with a family name starting with the letter "b".
<#zx45yh> a solid:ValueRegistration;
    solid:forRegex "/^[b]/i";
    solid:instance  </agents/persons/person56.ttl>, </agents/persons/person78.ttl>.

# This is indexing persons with a family name starting with the letter "z".
<#sk17vb> a solid:ValueRegistration;
    solid:forRegex "/^[z]/i";
    solid:instance  </agents/persons/person2.ttl>, </agents/persons/person63.ttl>.

Option 7b: use a SHACL pattern (sh:pattern)

pchampin commented 1 year ago

I like the idea of using a regexp (using SPARQL regular expressions like SHACL does, but not reusing the property sh:pattern, because it is a property of shapes, and value registrations are not shapes).

However, I foresee a performance vs. genericity tradeoff, here. Assume I am looking for someone named Champin:

FabienGandon commented 1 year ago

I concur with the nice analysis of @pchampin and I would suggest reusing known function names / properties to help adoption. For instance SPARQL has a function strstarts so we could have

@prefix solid: <http://www.w3.org/ns/solid/terms#>.
@prefix dfc-b: <https://www.datafoodconsortium.org#>.

<>
    a solid:ListedDocument;
    solid:forProperty dfc-b:familyName.

# This is indexing persons with a family name starting with the letter "a".
<#ab09fd> a solid:ValueRegistration;
    solid:strstarts "a";
    solid:instance  </agents/persons/person32.ttl>, </agents/persons/person12.ttl>.