Closed edgarf closed 4 years ago
This is in production (@excelsior move this card into "In Production" a few hours ago). We usually only close the cards during the call with Nate and Mike, just to make sure they agree it's implemented properly and is working well on production.
I'm currently working on an approach for this that we can leverage to fix the relevancy problem. I'll update this when I have something to present.
I may have the beginnings of an approach based on our previous meetings, and in particular a suggestion (I think it was from @science, but it was a few weeks ago so I might be wrong) to try tokenizing things and using a simple =
instead of REGEX()
to do fast matching.
The objective is to enable full text search and relevance-based sorting while removing as many slow operations as possible (e.g. REGEX()
, FILTER()
, REPLACE()
, STR()
, STRLEN()
, etc.) from the query itself. This approach would combine aggressive tokenization at index time with adjustable tokenization at query time in a way that allows for relevancy to be partially pre-calculated and partial string matching to be done without the use of REGEX()
.
Here are the pieces I've built out:
Suppose we have two resources, each with a description, as follows:
"Valuable text with a \"value\" where some text's value appears multiple times"
"Another text that has less value than the first text but has a lot more text in the text"
If we take the text, eliminate unimportant word endings like 's
, eliminate any special characters, force the text to lowercase, then drop any words less than two characters in length, that gives us a fairly normalized set of text to begin tokenizing.
Again, the index-time tokenization is aggressive here, to allow for the query-time tokenization to be fine-tuned as we try different things. We also want to be able to know how many times each word appears in the string, as this will help us with relevance calculations later on.
The following is some quick and dirty javascript you can try out in the browser console with the above strings to see the results:
var triples = [];
var words = [];
var count = 0;
text.toLowerCase().replace(/('s)|[^A-Za-z0-9 ]/g, "").split(" ").forEach(function(word){
if(word.length < 2){
return;
}
var wordData = words.filter(m => m.word == word)[0];
if(wordData){
wordData.total++;
}
else{
words.push({ word: word, total: 1 });
}
});
words.forEach(function(wordData){
triples.push("<_:123> credreg:__tokenSet <_:123/setid" + count + "> .");
triples.push("<_:123/setid" + count + "> credreg:__tokenText '" + wordData.word + "' .");
var stemcount = wordData.word.length - 3;
while(stemcount > 0){
triples.push("<_:123/setid" + count + "> credreg:__tokenText '" + wordData.word.substring(0, wordData.word.length - stemcount) + "' .");
stemcount--;
}
triples.push("<_:123/setid" + count + "> credreg:__tokenCount " + wordData.total + " .");
count++;
});
The first part:
text.toLowerCase().replace(/('s)|[^A-Za-z0-9 ]/g, "").split(" ")
does some broad normalization and chops the string up into words.
For each word, the next part:
if(word.length < 2){
return;
}
var wordData = words.filter(m => m.word == word)[0];
if(wordData){
wordData.total++;
}
else{
words.push({ word: word, total: 1 });
}
drops any words shorter than two characters (we may need to tinker with this, but I want to make sure we capture any acronyms) and counts the number of times the word appears in the string while also creating a de-duplicated array of words from the string itself (there's probably a cleaner way to do that - as I said, this is quick and dirty).
The next part, for each of the words in that de-duplicated array:
triples.push("<_:123> credreg:__tokenSet <_:123/setid" + count + "> .");
triples.push("<_:123/setid" + count + "> credreg:__tokenText '" + wordData.word + "' .");
var stemcount = wordData.word.length - 3;
while(stemcount > 0){
triples.push("<_:123/setid" + count + "> credreg:__tokenText '" + wordData.word.substring(0, wordData.word.length - stemcount) + "' .");
stemcount--;
}
triples.push("<_:123/setid" + count + "> credreg:__tokenCount " + wordData.total + " .");
count++;
Starts adding triples to an array that contains the output. For each word, it creates an untyped bnode that contains the original word, its aggressively-tokenized forms (basically chopping a letter off the end until the word is only three characters long), and the number of times the original, untokenized word appears in the original source string. If we were to turn the resulting pile of triples into JSON, it would look like the value of ceterms:description__tokens
below:
{
"ceterms:description": { "en": "Valuable text with a \"value\" where some text's value appears multiple times" },
"ceterms:description__tokens": {
"@id": "_:123",
"credreg:__tokenSet": [
{
"@id": "_:123/setid0",
"credreg:__tokenText": [
"valuable", "val", "valu", "valua", "valuab", "valuabl"
],
"credreg:__tokenCount": 1
},
{
"@id": "_:123/setid1",
"credreg:__tokenText": [
"text", "tex"
],
"credreg:__tokenCount": 2
},
{ ... }
]
}
}
If we were to run the above process for both of the resources, and extract the resulting triples, we would get data like this (naturally in production the bnode IDs would all be GUID-based, but I didn't feel like typing those out):
@prefix ceterms: <https://credreg.net/ctdl/terms/> .
@prefix credreg: <https://credreg.net/sparql/> .
<https://credentialengineregistry.org/resources/ce-abc> ceterms:description__tokens <_:123> .
<https://credentialengineregistry.org/resources/ce-abc> credreg:__payload '{ \"property\": \"value\" }' .
<_:123> credreg:__tokenSet <_:123/setid0> .
<_:123/setid0> credreg:__tokenText 'valuable' .
<_:123/setid0> credreg:__tokenText 'val' .
<_:123/setid0> credreg:__tokenText 'valu' .
<_:123/setid0> credreg:__tokenText 'valua' .
<_:123/setid0> credreg:__tokenText 'valuab' .
<_:123/setid0> credreg:__tokenText 'valuabl' .
<_:123/setid0> credreg:__tokenCount 1 .
<_:123> credreg:__tokenSet <_:123/setid1> .
<_:123/setid1> credreg:__tokenText 'text' .
<_:123/setid1> credreg:__tokenText 'tex' .
<_:123/setid1> credreg:__tokenCount 2 .
<_:123> credreg:__tokenSet <_:123/setid2> .
<_:123/setid2> credreg:__tokenText 'with' .
<_:123/setid2> credreg:__tokenText 'wit' .
<_:123/setid2> credreg:__tokenCount 1 .
<_:123> credreg:__tokenSet <_:123/setid3> .
<_:123/setid3> credreg:__tokenText 'value' .
<_:123/setid3> credreg:__tokenText 'val' .
<_:123/setid3> credreg:__tokenText 'valu' .
<_:123/setid3> credreg:__tokenCount 2 .
<_:123> credreg:__tokenSet <_:123/setid4> .
<_:123/setid4> credreg:__tokenText 'where' .
<_:123/setid4> credreg:__tokenText 'whe' .
<_:123/setid4> credreg:__tokenText 'wher' .
<_:123/setid4> credreg:__tokenCount 1 .
<_:123> credreg:__tokenSet <_:123/setid5> .
<_:123/setid5> credreg:__tokenText 'some' .
<_:123/setid5> credreg:__tokenText 'som' .
<_:123/setid5> credreg:__tokenCount 1 .
<_:123> credreg:__tokenSet <_:123/setid6> .
<_:123/setid6> credreg:__tokenText 'appears' .
<_:123/setid6> credreg:__tokenText 'app' .
<_:123/setid6> credreg:__tokenText 'appe' .
<_:123/setid6> credreg:__tokenText 'appea' .
<_:123/setid6> credreg:__tokenText 'appear' .
<_:123/setid6> credreg:__tokenCount 1 .
<_:123> credreg:__tokenSet <_:123/setid7> .
<_:123/setid7> credreg:__tokenText 'multiple' .
<_:123/setid7> credreg:__tokenText 'mul' .
<_:123/setid7> credreg:__tokenText 'mult' .
<_:123/setid7> credreg:__tokenText 'multi' .
<_:123/setid7> credreg:__tokenText 'multip' .
<_:123/setid7> credreg:__tokenText 'multipl' .
<_:123/setid7> credreg:__tokenCount 1 .
<_:123> credreg:__tokenSet <_:123/setid8> .
<_:123/setid8> credreg:__tokenText 'times' .
<_:123/setid8> credreg:__tokenText 'tim' .
<_:123/setid8> credreg:__tokenText 'time' .
<_:123/setid8> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-def> ceterms:description__tokens <_:456> .
<https://credentialengineregistry.org/resources/ce-def> credreg:__payload '{ \"property\": \"value 2\" }' .
<_:456> credreg:__tokenSet <_:456/setid0>
<_:456/setid0> credreg:__tokenText 'another' .
<_:456/setid0> credreg:__tokenText 'ano' .
<_:456/setid0> credreg:__tokenText 'anot' .
<_:456/setid0> credreg:__tokenText 'anoth' .
<_:456/setid0> credreg:__tokenText 'anothe' .
<_:456/setid0> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid1>
<_:456/setid1> credreg:__tokenText 'text' .
<_:456/setid1> credreg:__tokenText 'tex' .
<_:456/setid1> credreg:__tokenCount 4 .
<_:456> credreg:__tokenSet <_:456/setid2>
<_:456/setid2> credreg:__tokenText 'that' .
<_:456/setid2> credreg:__tokenText 'tha' .
<_:456/setid2> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid3>
<_:456/setid3> credreg:__tokenText 'has' .
<_:456/setid3> credreg:__tokenCount 2 .
<_:456> credreg:__tokenSet <_:456/setid4>
<_:456/setid4> credreg:__tokenText 'less' .
<_:456/setid4> credreg:__tokenText 'les' .
<_:456/setid4> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid5>
<_:456/setid5> credreg:__tokenText 'value' .
<_:456/setid5> credreg:__tokenText 'val' .
<_:456/setid5> credreg:__tokenText 'valu' .
<_:456/setid5> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid6>
<_:456/setid6> credreg:__tokenText 'than' .
<_:456/setid6> credreg:__tokenText 'tha' .
<_:456/setid6> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid7>
<_:456/setid7> credreg:__tokenText 'the' .
<_:456/setid7> credreg:__tokenCount 2 .
<_:456> credreg:__tokenSet <_:456/setid8>
<_:456/setid8> credreg:__tokenText 'first' .
<_:456/setid8> credreg:__tokenText 'fir' .
<_:456/setid8> credreg:__tokenText 'firs' .
<_:456/setid8> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid9>
<_:456/setid9> credreg:__tokenText 'but' .
<_:456/setid9> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid10>
<_:456/setid10> credreg:__tokenText 'lot' .
<_:456/setid10> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid11>
<_:456/setid11> credreg:__tokenText 'more' .
<_:456/setid11> credreg:__tokenText 'mor' .
<_:456/setid11> credreg:__tokenCount 1 .
<_:456> credreg:__tokenSet <_:456/setid12>
<_:456/setid12> credreg:__tokenText 'in' .
<_:456/setid12> credreg:__tokenCount 1 .
It's a lot of triples, but it might be the trick to enabling fast queries. I'm not sure how this will perform at scale, but I've used the above pile of triples along with variations of the following query in a blazegraph instance running on my machine and I was able to get successful results.
The query side of things would do more or less the same parsing, normalizing, and tokenizing of the user's input strings as the index-time tokenization, in an effort to make sure that the words are more likely to match up. It also keeps track of (and eventually adds up) the number of matches per-word to determine an overall per-result relevance score. The part of the query that handles that looks like this:
PREFIX ceterms: <https://credreg.net/ctdl/terms/>
PREFIX credreg: <https://credreg.net/sparql/>
SELECT DISTINCT ?subject ?payload (SUM(?tokenCount) AS ?relevance) WHERE {
VALUES ?userText { 'text' 'tex' 'valu' 'value' }
?subject ceterms:description__tokens ?tokens .
?tokens credreg:__tokenSet ?set .
?set credreg:__tokenText ?userText .
?set credreg:__tokenCount ?tokenCount .
?subject credreg:__payload ?payload .
} GROUP BY ?subject ?payload ORDER BY DESC(?relevance)
Obviously the real thing would have more going on, but the above shows how both full text matching and relevance calculation should be possible with this approach. I plan to test with a larger data set when I get the chance, but I wanted to post this now to start getting ideas and feedback. The above query works in my blazegraph instance, and if I change around the contents of the VALUES{}
block I can get one, the other, or both results to appear and be ordered by their relevance score.
My hope is that this will scale well. It should at least scale a lot better than the current approach. In particular, the lack of FILTER()
should keep it from scanning through so much data.
That's where I'm at right now. It's promising, but I need to try it with a much larger data set. When I get a chance, I'm going to see if there's a way to inject this tokenization into the production dataset dump from last week, or at least a subset of it, and go from there.
One more thing - the deduplication is at the word level, not the token level, so the above data has tokens like valu
in more than one word (from "valuable" and "value"). This means that a user query which tokenizes to valu
will match both original words and get an even higher score. This is intended functionality, at least for now.
One thing this wouldn't handle is phrase matching, particularly if the phrase contains a one-character word or special characters. I would really like to be able to solve that without FILTER()
and/or REGEX()
and am open to suggestions.
If there's a way to retrieve the relevance score from AWS's elasticsearch (and not just which things match at all), that would probably be a superior approach in the end. However, I don't see any reference to it in the documentation. https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html
You could probably hack something together by querying the same thing multiple times with different minScore
configurations to get a rough estimate, but that is obviously not efficient.
Per our meeting today, I'm going to update the approach and examples above to include language tags to enable language-based searching/filtering.
I also plan to try testing this on a larger scale with some real data.
Okay, now bigger and badder:
As before, two resources - now with a language map field, a plain string field, and a URI field:
{
"@id": "https://credentialengineregistry.org/resources/ce-abc",
"ceterms:description": { "en-US": "Valuable text with a \"value\" where some text's value appears multiple times" },
"ceterms:codedNotation": "ABC.123-456",
"ceterms:subjectWebpage": "http://credentialengine.org/page?reallyimportantid=abc123999ZZZ&othervalue=blah%20text%21test%23"
}
{
"@id": "https://credentialengineregistry.org/resources/ce-def",
"ceterms:description": { "en": "Another text that has less value than the first text but has a lot more text in the text", "fr": "Un autre texte qui a moins de valeur que le premier texte mais qui contient beaucoup plus de texte dans le texte" },
"ceterms:codedNotation": "002A",
"ceterms:subjectWebpage": "https://credentialregistry.net/res/item/folder/directory/002A"
}
A more advanced javascript implementation to tokenize those three types of data:
function tokenizeResource(resource){
var finalTriples = [];
finalTriples = finalTriples.concat(tokenizeLanguageMap(resource, "ceterms:description"));
finalTriples = finalTriples.concat(tokenizePlainString(resource, "ceterms:codedNotation"));
finalTriples = finalTriples.concat(tokenizeURI(resource, "ceterms:subjectWebpage"));
finalTriples.push(makeTriple(resource["@id"], "credreg:__payload", null, JSON.stringify(resource).replace(/'/g, "\\'")));
return finalTriples;
}
function tokenizeLanguageMap(resource, property){
var resultTriples = [];
var resourceID = resource["@id"];
var map = resource[property];
Object.keys(map).forEach(function(languageCode){
if(Array.isArray(map[languageCode])){
map[languageCode].forEach(function(item){
var bnodeURI = generateBlankNodeID();
resultTriples.push(makeTriple(resourceID, property + "__tokenData", bnodeURI));
resultTriples = resultTriples.concat(tokenizeLanguageString(bnodeURI, item, languageCode));
});
}
else{
var bnodeURI = generateBlankNodeID();
resultTriples.push(makeTriple(resourceID, property + "__tokenData", bnodeURI));
resultTriples = resultTriples.concat(tokenizeLanguageString(bnodeURI, map[languageCode], languageCode));
}
});
return resultTriples;
}
function tokenizeLanguageString(bnodeURI, text, languageCode){
//Hold the data
var triples = [];
var words = [];
var count = 0;
//Normalize the words
text.toLowerCase().replace(/('s)|[^A-Za-z0-9 ]/g, "").split(" ").forEach(function(word){
if(word.length > 1){
addOrIncrementWord(words, word);
}
});
//Add the language code
languageCodeFull = languageCode.toLowerCase();
languageCodeFirst = languageCodeFull.split("-")[0];
triples.push(makeTriple(bnodeURI, "credreg:__tokenLanguage", null, languageCodeFull));
if(languageCodeFull != languageCodeFirst){
triples.push(makeTriple(bnodeURI, "credreg:__tokenLanguage", null, languageCodeFirst));
}
//Add the tokens, including the full word
words.forEach(function(wordData){
var countID = bnodeURI + "/setid_" + count;
triples.push(makeTriple(bnodeURI, "credreg:__tokenSet", countID));
triples.push(makeTriple(countID, "credreg:__tokenText", null, wordData.word));
var stemcount = wordData.word.length - 3;
while(stemcount > 0){
triples.push(makeTriple(countID, "credreg:__tokenText", null, wordData.word.substring(0, wordData.word.length - stemcount)));
stemcount--;
}
triples.push(makeTriple(countID, "credreg:__tokenCount", null, wordData.total));
count++;
});
//Return the triples
return triples;
}
function tokenizePlainString(resource, property){
var resourceID = resource["@id"];
var text = resource[property];
//Hold the data
var triples = [];
var words = [];
var count = 0;
//Connect the property
var bnodeURI = generateBlankNodeID();
triples.push(makeTriple(resourceID, property + "__tokenData", bnodeURI));
//Add the normalized full text to the words list
var lowercase = text.toLowerCase();
words.push({ word: lowercase, total: 1 });
//Add the words by breaking the code into pieces
var parts = lowercase.replace(/[^A-Za-z0-9]/g, "<SPLIT>").split("<SPLIT>");
if(parts.length > 1){
parts.forEach(function(word){
addOrIncrementWord(words, word);
});
}
//Add the tokens
words.forEach(function(wordData){
var countID = bnodeURI + "/setid_" + count;
triples.push(makeTriple(bnodeURI, "credreg:__tokenSet", countID));
triples.push(makeTriple(countID, "credreg:__tokenText", null, wordData.word));
triples.push(makeTriple(countID, "credreg:__tokenCount", null, wordData.total));
count++;
});
//Return the triples
return triples;
}
function tokenizeURI(resource, property){
var resourceID = resource["@id"];
var text = resource[property];
//Hold the data
var triples = [];
var words = [];
var count = 0;
//Connect the property
var bnodeURI = generateBlankNodeID();
triples.push(makeTriple(resourceID, property + "__tokenData", bnodeURI));
//Add the normalized full text to the words list, with and without the query string
var lowercase = text.toLowerCase().replace("https://", "").replace("http://", "").replace(/\/$/, "").replace(/\?$/, "");
words.push({ word: lowercase, total: 1 });
//Handle the main URI
lowercase.split("?")[0].replace(/[^A-Za-z0-9/\.]/g, "").split(/[\/\.]/).forEach(function(word){
addOrIncrementWord(words, word);
});
//Handle query parameters
var queryString = lowercase.split("?")[1];
if(queryString){
//Add the queryStringless version of the URI
words.push({ word: lowercase.split("?")[0], total: 1 });
queryString.split("&").forEach(function(parameterAndValue){
var parts = parameterAndValue.split("=");
var values = decodeURIComponent(parts[1]).replace(/[^A-Za-z0-9\.]/g, "<SPLIT>").split("<SPLIT>").filter(function(m){ return m.length > 1 });
if(values.length > 0){
addOrIncrementWord(words, parts[0]);
values.forEach(function(value){
addOrIncrementWord(words, value);
});
}
});
}
//Add the tokens
words.forEach(function(wordData){
var countID = bnodeURI + "/setid_" + count;
triples.push(makeTriple(bnodeURI, "credreg:__tokenSet", countID));
triples.push(makeTriple(countID, "credreg:__tokenText", null, wordData.word));
triples.push(makeTriple(countID, "credreg:__tokenCount", null, wordData.total));
count++;
});
//Return the triples
return triples;
}
function addOrIncrementWord(words, word){
var match = words.filter(function(m){ return m.word == word })[0];
if(match){
match.total++;
}
else{
words.push({ word: word, total: 1 });
}
}
function generateBlankNodeID(){
return "_:" + Math.ceil(Math.random() * 10000); //Total hack. Replace this with something that generates a GUID.
}
function makeTriple(subject, predicate, objectURI, objectLiteral){
return "<" + subject + "> " + predicate + " " + (objectURI ? "<" + objectURI + ">" : (typeof(objectLiteral) == "string" ? "'" + objectLiteral + "'" : objectLiteral)) + " .";
}
Notice that each type gets is own specific tokenization strategy. I'm open to suggestions on any of those.
Next, a JSON visualization of what the resulting objects will look like. Again, this is just here to show the structure. The only thing unique about the language map structure is the inclusion of the credreg:__tokenLanguage
property (and ceterms:description__tokens
now being an array):
{
"ceterms:description": { "en": "Valuable text with a \"value\" where some text's value appears multiple times" },
"ceterms:description__tokenData": [
{
"@id": "_:123",
"credreg:__tokenLanguage": [ "en-us", "en" ],
"credreg:__tokenSet": [
{
"@id": "_:123/setid0",
"credreg:__tokenText": [
"valuable", "val", "valu", "valua", "valuab", "valuabl"
],
"credreg:__tokenCount": 1
},
{
"@id": "_:123/setid1",
"credreg:__tokenText": [
"text", "tex"
],
"credreg:__tokenCount": 2
},
{ ... }
]
},
{
"@id": "_:456",
"credreg:__tokenLanguage": [ "en" ],
"credreg:__tokenSet" : [
{ ... }
]
},
{
"@id": "_:
}
]
}
The resulting triples:
@prefix ceterms: <https://credreg.net/ctdl/terms/> .
@prefix credreg: <https://credreg.net/sparql/> .
<https://credentialengineregistry.org/resources/ce-abc> ceterms:description__tokenData <_:9346> .
<_:9346> credreg:__tokenLanguage 'en-us' .
<_:9346> credreg:__tokenLanguage 'en' .
<_:9346> credreg:__tokenSet <_:9346/setid_0> .
<_:9346/setid_0> credreg:__tokenText 'valuable' .
<_:9346/setid_0> credreg:__tokenText 'val' .
<_:9346/setid_0> credreg:__tokenText 'valu' .
<_:9346/setid_0> credreg:__tokenText 'valua' .
<_:9346/setid_0> credreg:__tokenText 'valuab' .
<_:9346/setid_0> credreg:__tokenText 'valuabl' .
<_:9346/setid_0> credreg:__tokenCount 1 .
<_:9346> credreg:__tokenSet <_:9346/setid_1> .
<_:9346/setid_1> credreg:__tokenText 'text' .
<_:9346/setid_1> credreg:__tokenText 'tex' .
<_:9346/setid_1> credreg:__tokenCount 2 .
<_:9346> credreg:__tokenSet <_:9346/setid_2> .
<_:9346/setid_2> credreg:__tokenText 'with' .
<_:9346/setid_2> credreg:__tokenText 'wit' .
<_:9346/setid_2> credreg:__tokenCount 1 .
<_:9346> credreg:__tokenSet <_:9346/setid_3> .
<_:9346/setid_3> credreg:__tokenText 'value' .
<_:9346/setid_3> credreg:__tokenText 'val' .
<_:9346/setid_3> credreg:__tokenText 'valu' .
<_:9346/setid_3> credreg:__tokenCount 2 .
<_:9346> credreg:__tokenSet <_:9346/setid_4> .
<_:9346/setid_4> credreg:__tokenText 'where' .
<_:9346/setid_4> credreg:__tokenText 'whe' .
<_:9346/setid_4> credreg:__tokenText 'wher' .
<_:9346/setid_4> credreg:__tokenCount 1 .
<_:9346> credreg:__tokenSet <_:9346/setid_5> .
<_:9346/setid_5> credreg:__tokenText 'some' .
<_:9346/setid_5> credreg:__tokenText 'som' .
<_:9346/setid_5> credreg:__tokenCount 1 .
<_:9346> credreg:__tokenSet <_:9346/setid_6> .
<_:9346/setid_6> credreg:__tokenText 'appears' .
<_:9346/setid_6> credreg:__tokenText 'app' .
<_:9346/setid_6> credreg:__tokenText 'appe' .
<_:9346/setid_6> credreg:__tokenText 'appea' .
<_:9346/setid_6> credreg:__tokenText 'appear' .
<_:9346/setid_6> credreg:__tokenCount 1 .
<_:9346> credreg:__tokenSet <_:9346/setid_7> .
<_:9346/setid_7> credreg:__tokenText 'multiple' .
<_:9346/setid_7> credreg:__tokenText 'mul' .
<_:9346/setid_7> credreg:__tokenText 'mult' .
<_:9346/setid_7> credreg:__tokenText 'multi' .
<_:9346/setid_7> credreg:__tokenText 'multip' .
<_:9346/setid_7> credreg:__tokenText 'multipl' .
<_:9346/setid_7> credreg:__tokenCount 1 .
<_:9346> credreg:__tokenSet <_:9346/setid_8> .
<_:9346/setid_8> credreg:__tokenText 'times' .
<_:9346/setid_8> credreg:__tokenText 'tim' .
<_:9346/setid_8> credreg:__tokenText 'time' .
<_:9346/setid_8> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-abc> ceterms:codedNotation__tokenData <_:4455> .
<_:4455> credreg:__tokenSet <_:4455/setid_0> .
<_:4455/setid_0> credreg:__tokenText 'abc.123-456' .
<_:4455/setid_0> credreg:__tokenCount 1 .
<_:4455> credreg:__tokenSet <_:4455/setid_1> .
<_:4455/setid_1> credreg:__tokenText 'abc' .
<_:4455/setid_1> credreg:__tokenCount 1 .
<_:4455> credreg:__tokenSet <_:4455/setid_2> .
<_:4455/setid_2> credreg:__tokenText '123' .
<_:4455/setid_2> credreg:__tokenCount 1 .
<_:4455> credreg:__tokenSet <_:4455/setid_3> .
<_:4455/setid_3> credreg:__tokenText '456' .
<_:4455/setid_3> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-abc> ceterms:subjectWebpage__tokenData <_:733> .
<_:733> credreg:__tokenSet <_:733/setid_0> .
<_:733/setid_0> credreg:__tokenText 'credentialengine.org/page?reallyimportantid=abc123999zzz&othervalue=blah%20text%21test%23' .
<_:733/setid_0> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_1> .
<_:733/setid_1> credreg:__tokenText 'credentialengine' .
<_:733/setid_1> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_2> .
<_:733/setid_2> credreg:__tokenText 'org' .
<_:733/setid_2> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_3> .
<_:733/setid_3> credreg:__tokenText 'page' .
<_:733/setid_3> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_4> .
<_:733/setid_4> credreg:__tokenText 'credentialengine.org/page' .
<_:733/setid_4> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_5> .
<_:733/setid_5> credreg:__tokenText 'reallyimportantid' .
<_:733/setid_5> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_6> .
<_:733/setid_6> credreg:__tokenText 'abc123999zzz' .
<_:733/setid_6> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_7> .
<_:733/setid_7> credreg:__tokenText 'othervalue' .
<_:733/setid_7> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_8> .
<_:733/setid_8> credreg:__tokenText 'blah' .
<_:733/setid_8> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_9> .
<_:733/setid_9> credreg:__tokenText 'text' .
<_:733/setid_9> credreg:__tokenCount 1 .
<_:733> credreg:__tokenSet <_:733/setid_10> .
<_:733/setid_10> credreg:__tokenText 'test' .
<_:733/setid_10> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-abc> credreg:__payload '{"@id":"https://credentialengineregistry.org/resources/ce-abc","ceterms:description":{"en-US":"Valuable text with a \"value\" where some text\'s value appears multiple times"},"ceterms:codedNotation":"ABC.123-456","ceterms:subjectWebpage":"http://credentialengine.org/page?reallyimportantid=abc123999ZZZ&othervalue=blah%20text%21test%23"}' .
<https://credentialengineregistry.org/resources/ce-def> ceterms:description__tokenData <_:4598> .
<_:4598> credreg:__tokenLanguage 'en' .
<_:4598> credreg:__tokenSet <_:4598/setid_0> .
<_:4598/setid_0> credreg:__tokenText 'another' .
<_:4598/setid_0> credreg:__tokenText 'ano' .
<_:4598/setid_0> credreg:__tokenText 'anot' .
<_:4598/setid_0> credreg:__tokenText 'anoth' .
<_:4598/setid_0> credreg:__tokenText 'anothe' .
<_:4598/setid_0> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_1> .
<_:4598/setid_1> credreg:__tokenText 'text' .
<_:4598/setid_1> credreg:__tokenText 'tex' .
<_:4598/setid_1> credreg:__tokenCount 4 .
<_:4598> credreg:__tokenSet <_:4598/setid_2> .
<_:4598/setid_2> credreg:__tokenText 'that' .
<_:4598/setid_2> credreg:__tokenText 'tha' .
<_:4598/setid_2> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_3> .
<_:4598/setid_3> credreg:__tokenText 'has' .
<_:4598/setid_3> credreg:__tokenCount 2 .
<_:4598> credreg:__tokenSet <_:4598/setid_4> .
<_:4598/setid_4> credreg:__tokenText 'less' .
<_:4598/setid_4> credreg:__tokenText 'les' .
<_:4598/setid_4> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_5> .
<_:4598/setid_5> credreg:__tokenText 'value' .
<_:4598/setid_5> credreg:__tokenText 'val' .
<_:4598/setid_5> credreg:__tokenText 'valu' .
<_:4598/setid_5> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_6> .
<_:4598/setid_6> credreg:__tokenText 'than' .
<_:4598/setid_6> credreg:__tokenText 'tha' .
<_:4598/setid_6> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_7> .
<_:4598/setid_7> credreg:__tokenText 'the' .
<_:4598/setid_7> credreg:__tokenCount 2 .
<_:4598> credreg:__tokenSet <_:4598/setid_8> .
<_:4598/setid_8> credreg:__tokenText 'first' .
<_:4598/setid_8> credreg:__tokenText 'fir' .
<_:4598/setid_8> credreg:__tokenText 'firs' .
<_:4598/setid_8> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_9> .
<_:4598/setid_9> credreg:__tokenText 'but' .
<_:4598/setid_9> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_10> .
<_:4598/setid_10> credreg:__tokenText 'lot' .
<_:4598/setid_10> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_11> .
<_:4598/setid_11> credreg:__tokenText 'more' .
<_:4598/setid_11> credreg:__tokenText 'mor' .
<_:4598/setid_11> credreg:__tokenCount 1 .
<_:4598> credreg:__tokenSet <_:4598/setid_12> .
<_:4598/setid_12> credreg:__tokenText 'in' .
<_:4598/setid_12> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-def> ceterms:description__tokenData <_:2882> .
<_:2882> credreg:__tokenLanguage 'fr' .
<_:2882> credreg:__tokenSet <_:2882/setid_0> .
<_:2882/setid_0> credreg:__tokenText 'un' .
<_:2882/setid_0> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_1> .
<_:2882/setid_1> credreg:__tokenText 'autre' .
<_:2882/setid_1> credreg:__tokenText 'aut' .
<_:2882/setid_1> credreg:__tokenText 'autr' .
<_:2882/setid_1> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_2> .
<_:2882/setid_2> credreg:__tokenText 'texte' .
<_:2882/setid_2> credreg:__tokenText 'tex' .
<_:2882/setid_2> credreg:__tokenText 'text' .
<_:2882/setid_2> credreg:__tokenCount 4 .
<_:2882> credreg:__tokenSet <_:2882/setid_3> .
<_:2882/setid_3> credreg:__tokenText 'qui' .
<_:2882/setid_3> credreg:__tokenCount 2 .
<_:2882> credreg:__tokenSet <_:2882/setid_4> .
<_:2882/setid_4> credreg:__tokenText 'moins' .
<_:2882/setid_4> credreg:__tokenText 'moi' .
<_:2882/setid_4> credreg:__tokenText 'moin' .
<_:2882/setid_4> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_5> .
<_:2882/setid_5> credreg:__tokenText 'de' .
<_:2882/setid_5> credreg:__tokenCount 2 .
<_:2882> credreg:__tokenSet <_:2882/setid_6> .
<_:2882/setid_6> credreg:__tokenText 'valeur' .
<_:2882/setid_6> credreg:__tokenText 'val' .
<_:2882/setid_6> credreg:__tokenText 'vale' .
<_:2882/setid_6> credreg:__tokenText 'valeu' .
<_:2882/setid_6> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_7> .
<_:2882/setid_7> credreg:__tokenText 'que' .
<_:2882/setid_7> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_8> .
<_:2882/setid_8> credreg:__tokenText 'le' .
<_:2882/setid_8> credreg:__tokenCount 2 .
<_:2882> credreg:__tokenSet <_:2882/setid_9> .
<_:2882/setid_9> credreg:__tokenText 'premier' .
<_:2882/setid_9> credreg:__tokenText 'pre' .
<_:2882/setid_9> credreg:__tokenText 'prem' .
<_:2882/setid_9> credreg:__tokenText 'premi' .
<_:2882/setid_9> credreg:__tokenText 'premie' .
<_:2882/setid_9> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_10> .
<_:2882/setid_10> credreg:__tokenText 'mais' .
<_:2882/setid_10> credreg:__tokenText 'mai' .
<_:2882/setid_10> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_11> .
<_:2882/setid_11> credreg:__tokenText 'contient' .
<_:2882/setid_11> credreg:__tokenText 'con' .
<_:2882/setid_11> credreg:__tokenText 'cont' .
<_:2882/setid_11> credreg:__tokenText 'conti' .
<_:2882/setid_11> credreg:__tokenText 'contie' .
<_:2882/setid_11> credreg:__tokenText 'contien' .
<_:2882/setid_11> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_12> .
<_:2882/setid_12> credreg:__tokenText 'beaucoup' .
<_:2882/setid_12> credreg:__tokenText 'bea' .
<_:2882/setid_12> credreg:__tokenText 'beau' .
<_:2882/setid_12> credreg:__tokenText 'beauc' .
<_:2882/setid_12> credreg:__tokenText 'beauco' .
<_:2882/setid_12> credreg:__tokenText 'beaucou' .
<_:2882/setid_12> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_13> .
<_:2882/setid_13> credreg:__tokenText 'plus' .
<_:2882/setid_13> credreg:__tokenText 'plu' .
<_:2882/setid_13> credreg:__tokenCount 1 .
<_:2882> credreg:__tokenSet <_:2882/setid_14> .
<_:2882/setid_14> credreg:__tokenText 'dans' .
<_:2882/setid_14> credreg:__tokenText 'dan' .
<_:2882/setid_14> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-def> ceterms:codedNotation__tokenData <_:9621> .
<_:9621> credreg:__tokenSet <_:9621/setid_0> .
<_:9621/setid_0> credreg:__tokenText '002a' .
<_:9621/setid_0> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-def> ceterms:subjectWebpage__tokenData <_:631> .
<_:631> credreg:__tokenSet <_:631/setid_0> .
<_:631/setid_0> credreg:__tokenText 'credentialregistry.net/res/item/folder/directory/002a' .
<_:631/setid_0> credreg:__tokenCount 1 .
<_:631> credreg:__tokenSet <_:631/setid_1> .
<_:631/setid_1> credreg:__tokenText 'credentialregistry' .
<_:631/setid_1> credreg:__tokenCount 1 .
<_:631> credreg:__tokenSet <_:631/setid_2> .
<_:631/setid_2> credreg:__tokenText 'net' .
<_:631/setid_2> credreg:__tokenCount 1 .
<_:631> credreg:__tokenSet <_:631/setid_3> .
<_:631/setid_3> credreg:__tokenText 'res' .
<_:631/setid_3> credreg:__tokenCount 1 .
<_:631> credreg:__tokenSet <_:631/setid_4> .
<_:631/setid_4> credreg:__tokenText 'item' .
<_:631/setid_4> credreg:__tokenCount 1 .
<_:631> credreg:__tokenSet <_:631/setid_5> .
<_:631/setid_5> credreg:__tokenText 'folder' .
<_:631/setid_5> credreg:__tokenCount 1 .
<_:631> credreg:__tokenSet <_:631/setid_6> .
<_:631/setid_6> credreg:__tokenText 'directory' .
<_:631/setid_6> credreg:__tokenCount 1 .
<_:631> credreg:__tokenSet <_:631/setid_7> .
<_:631/setid_7> credreg:__tokenText '002a' .
<_:631/setid_7> credreg:__tokenCount 1 .
<https://credentialengineregistry.org/resources/ce-def> credreg:__payload '{"@id":"https://credentialengineregistry.org/resources/ce-def","ceterms:description":{"en":"Another text that has less value than the first text but has a lot more text in the text","fr":"Un autre texte qui a moins de valeur que le premier texte mais qui contient beaucoup plus de texte dans le texte"},"ceterms:codedNotation":"002A","ceterms:subjectWebpage":"https://credentialregistry.net/res/item/folder/directory/002A"}' .
And an updated query, written to allow for different combinations of inputs:
PREFIX ceterms: <https://credreg.net/ctdl/terms/>
PREFIX credreg: <https://credreg.net/sparql/>
SELECT DISTINCT ?subject ?payload (SUM(?tokenCount) AS ?relevance) WHERE {
VALUES ?userText { 'text' 'valu' 'value' }
VALUES ?userCode { '123' '002a' }
VALUES ?userWebpage { 'credentialregistry' 'reallyimportantid' 'abc123999zzz' }
VALUES ?inLanguage { 'en' }
?subject ceterms:description__tokenData ?tokens1 .
?tokens1 credreg:__tokenSet ?set .
?tokens1 credreg:__tokenLanguage ?inLanguage .
?set credreg:__tokenText ?userText .
?set credreg:__tokenCount ?tokenCount .
?subject ceterms:codedNotation__tokenData ?tokens2 .
?tokens2 credreg:__tokenSet ?set2 .
?set2 credreg:__tokenText ?userCode .
?subject ceterms:subjectWebpage__tokenData ?tokens3 .
?tokens3 credreg:__tokenSet ?set3 .
?set3 credreg:__tokenText ?userWebpage .
?subject credreg:__payload ?payload .
} GROUP BY ?subject ?payload ORDER BY DESC(?relevance)
Tested and working in my javascript console and blazegraph instance.
I am open to feedback on this.
I am currently working on refinements to the tokenization process to try to enable phrase matching to work as efficiently as possible, and make other improvements.
This should not have been closed, as it is not done yet.
I believe this can be closed. Please confirm.
@excelsior @edgarf @mparsons-ce is this an open issue? I'm reviewing all issues still flagged as High Priority. Update please.