sub-corpora as part of data handler

BalduinLandolt commented 3 years ago

groupings, as retrieved by searches, should be stored in subcorpora.

all set-theory operations (sum, intersection, ...) should be possible on subcorpora.

metadata should be retrievable on basis of subcorpora

kraus-s commented 2 years ago

I can see this quickly getting out of hand if we store every search operation as a subcorpus in the data handler. How about a corpus building pipeline that would allow for the results of different search operations to be combined, i.e. like a little plus button at the bottom, which would add the results to a corpus in the session state. Could then be saved as file/pickle to be loaded back into the handler later, so it's not permanently in memory. If that sounds something like what you had in mind, I'll get started on it.

BalduinLandolt commented 2 years ago

I'm not sure we really want to make the subcorpora persistent, at least at first... And if they only last for the runtime, we don't need to worry about things getting out of hand right away. This would allow for implementing a nice prototype that we then can discuss with team meckern/product owners. ;)
(also, I dislike the term "subcorpus" more and more. maybe we could come up with something better... maybe "group" or so? Do you have better ideas?)

The way I envisioned it was roughly as follows:

we don't expose too much of the groups to the user.
in the search page there is a store results as group-button, that lets you then enter a name for the group
then we could have a group-page, where users get a list for each group there is; and options for removing and editing groups.
when editing groups, it should be possible to add/remove single entities manually; but more importantly, to combine multiple groups (of course with option 'intersection' or 'union' ('shared' vs 'combined'))
then for visualization or export, we could simply use a group

Does this make sense to you?

in terms of architecture, I think it should be a class, that holds

the group's name
a list of entities (or more precisely, a list of IDs of entities)
creation date
maybe the entity type?

the thing I'm least sure about is, what entities we allow in groups? Only manuscripts? Or also persons and texts?

Long story short: let me know if you plan on working on that, or if I should! Would be nice to get this done as soon as possible...

arbeitsgruppe-digitale-altnordistik / Sammlung-Toole

sub-corpora as part of data handler #65