abartov / bybeconv

Project Ben-Yehuda's content management system.
https://benyehuda.org/
Other
10 stars 5 forks source link

Support CorporateBodies in author list and filters #304

Closed abartov closed 1 month ago

abartov commented 3 months ago

(I think the use case for filter by corp's inception/dissolution date is extremely rare, and not worth supporting.)

Work off the merge_involved_authorities branch.

damisul commented 2 months ago

Hi @abartov !

Regarding this ticket.

Currently we implemented corporate bodies using separate table. This approach has pros and cons. Main advantage is that we know, that all records in corporate_bodies table has the same structure and uses same fields.

But when it comes to authors browse page we face one problem: it is built on base of ElasticSearch PeopleIndex which currently does not includes any info about corporate_bodies.

I see following ways to resolve this:

  1. We can update PeopleIndex to include information about corporate_bodies as well (and probably rename it to AuthoritiesIndex). This is possible, but will require custom logic to do reindex.

  2. We can create separate index for CorporateBodies only, and try to use Chewy's multiindex search. From the docs it looks doable, but I never did so before.

  3. Third way is most radical one. Instead of creating separate table for corporate_bodies we can store them in same table as other persons (we can also consider renaming it to Authorities). With this we can easily get single Index and we can avoid adding polymorphic relations everywhere in code, and I believe will make overal code much simpler. But it will require to drop significant part of changes already made in this branch.

What do you think?

Sidenote: I'm not a native english speaker, so for me Authority sounds a bit misleading here. To me first association with 'authority' is police or some legislative organ... Maybe we can consider something like 'Creator'? If 'Authority' is some special terminology use by librarians, then we're completely ok here.

abartov commented 2 months ago

Hi.

I agree these are three ways to go. Note that the searchbar already uses multi-index searching, to search PeopleIndex, ManifestationsIndex, and DictIndex all at once. The search results are displayed with polymorphism logic.

It seems to me that merging the two logical tables into one ES index (AuthoritiesIndex) would be the best compromise -- conveniently denormalized entity for ES, but proper normalized tables in DB for CRUD and back-end logic.

Regarding the term: yes, it does sound strange, but it is indeed the professional term used by librarians to describe the person/org that is responsible for an item. Authority as in authorship, not as in the monopoly-on-force sense that's the ready association for muggles. :)

damisul commented 2 months ago

Ok, thanks!

Actually if we have working example of multiindex search I'd like to give it a try first. Just want to avoid adding non-standard indexing logic.

abartov commented 2 months ago

Look at ManifestationsSearch#index. It's trivial code, basically all done by Chewy.

(btw, I think composing several DB models into one ES entity/index is fairly common practice, no?)

damisul commented 2 months ago

Well, essentially yes, and Chewy supports this. As far as I see Chewy supports reindexing either from scope (but simple scope for multiple tables is not possible), so we'll need to pass all records as an array and it will require to load all records from people and corporate_bodies into a memory.

Another issue, is that we'll need to update indexing code for every field to be like: field: if Corportate body then else if Peson then . We can reduce hurdle by some degree by using same names in both tables for common attributes, but...

I still believe using common table for common attributes in DB will be a better option. In fact we can create an Authorities table and put there all data common between Person and CorporateBody (e.g. copyright_status). And we can add to People and Corporate_bodies table a link to Authority record, so in those two tables we'll store only data specific for people/corporate_bodies. This way data in DB level will be even more normalized and it will allow to

Let's suppose we'll need to find all works where copyright differs from authority copyright (incongruous_copyright report, see my question https://github.com/abartov/bybeconv/issues/209#issuecomment-2077817507). With current db structure (in merge_involved_authorites branch) we'll need to outer join InvolvedAuthorities table to both People and CorporateBody using polymorphic key while in more normalized situation I've describe above we can make a simple join on Authorities table and check copyright_status there. Yes we can implement this using ruby loops instead of SQL, and in ruby code will be pretty fancy, but it will much slower as well.

And I expect a lot of such situations. So in my opinion two isolated table will significally increase complexity of code.