gbv / coli-ana

API to analyze DDC numbers
https://coli-conc.gbv.de/coli-ana/app/
MIT License
2 stars 0 forks source link

Get list of decompositions that use a given class #20

Closed nichtich closed 3 years ago

nichtich commented 3 years ago

Given a class such as T1--09044 return all known build numbers that make use of this class (e.g. 700.90440747471). This might become large result sets and requires a proper database index. Maybe directly use SQL tables and create JSKOS from the pieces instead of storing JSKOS in the database?

We could first do some analysis offline with converted files of decomposed DDC numbers (requires reduction via hierarchy as described in #17).

stefandesu commented 3 years ago

PostgreSQL should have no problem indexing JSON blobs, so it should be possible to do this without changing the data structure. 👍

stefandesu commented 3 years ago

After experimenting a bit, I've found out that there's a bit difference between using jsonb[] and jsonb containing an array. With the former, I can't find a way to perform the query needed for this. So we might need to perform a database transition.

stefandesu commented 3 years ago

Okay, here's what I've found: It is possible to perform the query whether we use jsonb[] or jsonb (containing an array). Here are the two queries:

For jsonb[]:

select *
from data
where '{ "notation": ["T1--09044"] }' <@ any("memberList");

For jsonb (containing an array):

select *
from data
where "memberList" @> '[{ "notation": ["T1--09044"] }]';

However, with jsonb[], I have not found a way to index the data in a way that the query uses the index. With jsonb (containing an array), we can use the following index:

CREATE INDEX data_memberList_gin_idx 
ON data 
USING gin ("memberList");

With this, the query above is able to use the index and perform in <1ms instead of ~20ms (with the current dataset of 600 rows).

Since this project is in a fairly early stage, I don't see an issue with breaking the current database structure since we can manually fix it on our server (which is very likely the only running instance of this).

What do you think, @nichtich?

nichtich commented 3 years ago

Ok I'd propose query syntax analyze?memberNotation=T1--09044 and analyze?member=http://dewey.info/class/1--0904/e23/.

stefandesu commented 3 years ago

I implemented this in dev, but with a single query parameter member that takes either a notation or a URI. I thought memberNotation was too clunky.

stefandesu commented 3 years ago

We should watch how this performs with larger datasets, but in theory it should work well with the index.