gbv / jskos-server

Web service to access JSKOS data
https://coli-conc.gbv.de/api/
MIT License
6 stars 4 forks source link

Filter search on top concepts #128

Open vpeil opened 3 years ago

vpeil commented 3 years ago

This may be beyond the scope of this project, but would be very useful. I would like to filter search results by top concepts.

Any idea, how this could be achieved?

stefandesu commented 3 years ago

Hi!

You mean that you want to restrict a search to concepts that are descendant from a certain top concept? I think we had other features in mind that have the same premise (looking at a particular subtree only, e.g. gbv/jskos-metrics#9), so it's certainly worth looking into it. I'm wondering how we could implement this efficiently. Maybe @nichtich has an idea?

nichtich commented 3 years ago

We could generate and index the ancestors field and allow to filter with query parameter ancestor={uri}. This only makes sense for mono-hierarchical vocabularies or cases where there the selected ancestor to filter with is reachable via all broader-pathes - but we don't need to check this. Adding ancestors to the database could be tricky for arbitrary concept updates because an updated concept might modify ancestor chains anywhere.

Maybe MongoDB graphLookup can help. The field to build the graph from is broader[0].uri.

vpeil commented 3 years ago

Yes, my use case in monohierarchical.

I will have a look at the graphLookup of MongoDB. I will post my findings here in any case, but this will take some time....

stefandesu commented 3 years ago

$graphLookup can definitely be used to implement this, but I'm not sure if it's possible to do it efficiently, i.e. without having to go through the whole Concepts collection.

stefandesu commented 3 years ago

I played around with $graphLookup a little bit (also because it might be useful for a different issue) and found something that could work, however only in a restricted fashion:

db.getCollection('concepts').aggregate([
{
    $match: { uri: "http://rvk.uni-regensburg.de/nt/A" }
},
{
    $graphLookup: {
        from: "concepts",
        startWith: "$uri",
        connectFromField: "uri",
        connectToField: "broader.uri",
        as: "descendant",
        restrictSearchWithMatch: {
            _keywordsLabels: { $regex: "^BIB" }
        }
    }
},
{
    $unwind: "$descendant"
},
{
    $replaceRoot: { newRoot: "$descendant" }
}
])

So we match only the desired parent concept (doesn't have to be a top concept), then we do a graph lookup like @nichtich described, but in reverse (matching from uri to broader.uri, and use the restrictSearchWithMatch to specify the search conditions. Then we unwind and replace the root.

Why did I say "restricted fashion"? The problem is that restrictSearchWithMatch doesn't seem to work with text indexes, and the query needs to be restrictive enough that the results can fit in memory. For reasons I don't fully understand, MongoDB has to load ALL results into memory first even if we only want a subset (e.g. the first 100). So the above example without restrictSearchWithMatch will not fit into memory, for example. I don't see a technical reason for this, either this use case is not common enough that MongoDB can't do it, or I'm missing something.

I'm mostly writing this down to document my findings. I still haven't fully grasped $lookup and $graphLookup and keep expecting them to do things they apparently cannot do. As mentioned somewhere else, sometimes I think a relational database would have been a better choice.

(@nichtich's first solution, i.e. generating and indexing an ancestors field, would still work and be very performant because it could use an index. The downside is, as always with these things, storage space. Having ancestors in the database for every concept takes up quite a lot of space.)