arbeitsgruppe-digitale-altnordistik / Sammlung-Toole

A new look on Handrit.is data
https://arbeitsgruppe-digitale-altnordistik.github.io/Sammlung-Toole/
MIT License
0 stars 0 forks source link

Replace the in-memory architecture of the handler with an SQLite DB #91

Closed kraus-s closed 2 years ago

kraus-s commented 2 years ago

This PR completely overhauls the backend, i.e. introducing a proper backend <-> frontend architecture.

kraus-s commented 2 years ago

Everything should be working off the new backend now. There are still some loose threads and there is a lot of clean-up to be done. Will make new branch for code cleaning and streamlining.

BalduinLandolt commented 2 years ago

handrit fixed their data, so I could test it now... there are still some bugs that will need fixing before this can be merged in good consciousness: we don't want to have to create another stable branch, right?

from what I can see, changing the return type from the queries from lists of IDs to dataframes with actual data (which I'm not sure is what one wants as a search result), caused lots of misaligning in the typing, messed up the UI in places, and broke saving results as groups (always gets a group of 21 manuscripts, which presumably is the number of columns in our dataframe).

the fundamental question we have to ask ourselves is: what should search results be? if we search for manuscripts that meet condition XY, do we expect the metadata of those manuscripts or just identifiers of the manuscripts? and analogously, with people and texts: do we expect metadata on people/texts (which we don't have at the moment) or do we expect identifiers.
Then we need to define the accessible API on the basis of this decision, and the backend should deliver exactly this (and nothing else), independently of how it actually works; just as the frontend should consume just that, no matter how it displays it.
But each time we change this API, we'll inevitably break all the code that uses this API from the front or back end side...

kraus-s commented 2 years ago

Fair point about IDs vs. full data to be delivered. My decision was based on efficiency: When retrieving full MS metadata, we run one query, and a cheap one at that as it doesn't even have to filter anything. If we want IDs in the first place and load metadata later on, we will get a cleaner front end and leaner result handling in the front end, but run two queries on the database to achieve the same result.

But I guess you're right, it is more intuitive to see a list of the search results before seeing the actual results. Also, fixing the backend to back to delivering IDs instead of full results should be easier than to fix the stuff I broke by changing it in the first place.

BalduinLandolt commented 2 years ago

I had another look at the code just now, and I'm starting to understand a bit better what's happening.

Performance is obviously an important concern, I'm with you on that.
But I think we could go for a compromise that wouldn't hit performance hard, really, and still be clean to implement:

simple_search() can stay the way it is, and return a DataFrame (maybe rename it to indicate that it retrieves manuscript metadata).
ms_x_ppl() etc. could be changed to return a list. This could easily be done by simplifying the query form "SELECT * FROM manuscripts WHERE full_id IN (SELECT msID FROM junctionPxM WHERE persID = '{i}')" to something along the lines of "SELECT msID FROM junctionPxM WHERE persID = '{i}'". So basically, we only do the junction table lookup at this point, and leave the manuscript metadata lookup for a second query.

That should already fix the frontend/API issues that arose, should perform quite well, and not be too much work on your PR.

What do you think?

kraus-s commented 2 years ago

That was pretty much what I was thinking. Only think that bugs me about the sqlite3 implementation is that the cursor object returns a list of tuples, and even when grabbing only one column, it will return something like [('MS 1', ), ('MS 2', )]. Nothing too difficult to solve, just annoying enough so I didn't start working on it yesterday...

BalduinLandolt commented 2 years ago

That was pretty much what I was thinking. Only think that bugs me about the sqlite3 implementation is that the cursor object returns a list of tuples, and even when grabbing only one column, it will return something like [('MS 1', ), ('MS 2', )]. Nothing too difficult to solve, just annoying enough so I didn't start working on it yesterday...

Perfect!

yea, this sucks... but should be easy enough with a list comprehension:

flatt_list = [t[0] for t in list_of_tuples]
kraus-s commented 2 years ago

Finally done!