Big Source Changes, Much Improvements, Wow.

ahankinson commented 3 months ago

This PR contains a whole boatload of improvements. In talking with @dchiller about this we thought we would do a monolithic PR to start with, and then see if anything needed to be changed before merging.

At its core this PR changes the way Source records are captured. With the introduction of the Institution model, the siglum and the institution name are now found in that table, and the Source identifier (the shelfmark) is replacing the former title and siglum fields on the Source model.

Along the way, a number of other changes were made.

The 'base.html' template gained two new blocks, 'title' and 'script', and all the <title> and <script> tags were moved to their appropriate blocks
A new experimental approach to Private Collectors is introduced, by marking any given institution as a private collector. This still needs to be fleshed out, though.
The shelfmark field was added to Sources
A hacky migration script was added to convert most of the Source records to the new structure. This still needs adjusting for Private collectors
The forms and templates for the source edit and create were altered, including a new autocomplete for assigning the holding institution
Optimized a bunch of queries and rewrote a couple to remove some code
Removed the siglum column from the source chant browse

The biggest change, however, is the rework of the Feast detail page. The previous code was very slow to load the page, but this is now pretty fast through the use of two raw SQL queries to fetch the necessary data for the view. I've compared it with the numbers from the existing setup, and it seems to give the same results, but the page load time has gone from 3+ seconds to ~250ms, and 1,500 queries (in one case) to a pretty constant 8 (in total for the page load). It also fixed a few problems with this page, and displays all the genres associated with a given chant, instead of just one.

How to review this PR

Normally I don't like large PRs, but it seemed appropriate that, since I was touching so many places to rework sources that I could also improve these places as I went.

I've tried to explain the changes thoroughly in every commit message. I'm also posting a few screenshots to compare the old and new changes.

Outstanding issues

Still to-do is to figure out how to deal with Private Collections. You may notice this in some places where a source will have None -- this just means there is a source, but it hasn't been migrated to the Institution / Shelfmark model, so it's not displaying the right data. This should be fixed before this PR is merged.

Also not tackled is any search interface, specifically how it deals with looking up sources by "siglum." Still TBD, particularly if we want to not break any existing searches... Feedback is useful here.

Screenshots

Changes on the left, Current Production on the right.

Feast detail. Note the genre column, and the removal of the chant with no cantus ID.

Browse Source chants, with removed siglum column

Edit Source, with Holding institution and shelfmark

annamorphism commented 3 months ago

oo, shiny! Comments on the first screenshot (Github is not letting me embiggen the other two right now)

The "chants"/"incipit" columns seem to be flipped in the "Most frequent chants" table
I gather that the source "None None" is something to still be dealt with, but where has the Source Formerly Known as P-Pm 1151 gone? And why did the HR-Zmk and PL-WRu seemingly lose a chant each?

annamorphism commented 3 months ago

further screenshot commentary: -removing the siglum from the browse chant table looks great! excellent change. -Edit source page: I see it is now optional to give a source a siglum+shelfmark. If somebody leaves this blank, what happens? Could they create a whole bunch of sources all named "None None"? -Relatedly, what happens if the editor can't find their institution in the dropdown list (is this all the institutions in the database? in RISM?) Do they have to contact somebody to add it?

ahankinson commented 3 months ago

The "chants"/"incipit" columns seem to be flipped in the "Most frequent chants" table

Good catch! I've fixed it.

I gather that the source "None None" is something to still be dealt with, but where has the Source Formerly Known as P-Pm 1151 gone? And why did the HR-Zmk and PL-WRu seemingly lose a chant each?

I'm working from a slightly older DB export at the moment, so it may not be completely up-to-date. P-Pm has the title "Porto 1511", which didn't match the "city, institution, shelfmark" pattern, so it got skipped in the conversion. It shows up as None None because of this.

Edit source page: I see it is now optional to give a source a siglum+shelfmark. If somebody leaves this blank, what happens? Could they create a whole bunch of sources all named "None None"?

It's only optional during the transition, since we can't automatically move everything and leave nothing blank. Once we've migrated everything and it "goes live" the option to leave it blank will be removed.

Relatedly, what happens if the editor can't find their institution in the dropdown list (is this all the institutions in the database? in RISM?) Do they have to contact somebody to add it?

That's kinda up to you guys. Do you want to allow people to add their own institutions, or would you rather create them on request. (I suspect the latter.) The only weird bit is Private Collectors, but I'm still working on a proposal for that.

Thanks for the review!

ahankinson commented 3 months ago

OK, @dchiller @lucasmarchd01, I think this is ready for review.

All tests pass, except for a bunch in the Feast Detail View that I had to mark as 'skip' because there are problems with testing it with transactions (I explain this in the commit message).

Once this is available and merged to staging, then the "migrate_records" management command should move all the source records to the new regime.

Some more changes will need to be made to the source records to deal with the optional / required fields, but for now I've left them as-is.

I'm hoping that there will be some time with this up on staging so that the others (@annamorphism ?) can play with the new setup and send their comments about the functional parts.

Looking forward to your comments.

ahankinson commented 2 months ago

@dchiller @lucasmarchd01 pinging for review. This PR touches a lot so it would be good to get it merged or reviewed soon so that the conflicts don’t get out of control

dchiller commented 2 months ago

Sorry, I'll get to this today! Between vacation and wanting to have #1548 move through the pipeline before and separately from this, I've been avoiding...

ahankinson commented 2 months ago

Is this a standard approach? I understand that this can improve performance for complex queries and offer more control, but I think that they might be more difficult to maintain and understand compared to Django’s ORM. That, along with all the tests that still need to be skipped.

Yes, it is a standard approach. There's a whole page in the docs about when you need to do raw queries:

https://docs.djangoproject.com/en/5.0/topics/db/sql/

I did try alternate methods with annotate() but the ORM query got incredibly complicated that way. SQL itself is very stable (it has to be!). It requires knowing a bit of SQL, but I personally think the direct approach is easier than a very convoluted ORM query. I worked on this particular problem for a good 2-3 days, and in the end this was what I came up with.

The problem is that this doesn't actually map on to any real "Model" query. It's a feast, sure, but there is so much other information that gets pulled in from the sources and chants, and also statistics about those queries, that it doesn't really map onto a good ORM call.

I tried for quite a while to get the tests to work, but because the test and the actual query are in two different transactions, they don't see each other.

dchiller commented 2 months ago

Fixing some migrations fixed the failing tests!

lucasmarchd01 commented 2 months ago

I did try alternate methods with annotate() but the ORM query got incredibly complicated that way. SQL itself is very stable (it has to be!). It requires knowing a bit of SQL, but I personally think the direct approach is easier than a very convoluted ORM query. I worked on this particular problem for a good 2-3 days, and in the end this was what I came up with.

The problem is that this doesn't actually map on to any real "Model" query. It's a feast, sure, but there is so much other information that gets pulled in from the sources and chants, and also statistics about those queries, that it doesn't really map onto a good ORM call.

Makes sense, thanks for the clarification!

dchiller commented 2 months ago

The provenance and century detail pages have odd orderings now when changes to templates and views are combined. I'll make a new issue to match what we now do with notation views.

ahankinson commented 2 months ago

The preferred ordering was unclear to me. In many places it was done by siglum, so the Austrian (A-) sources would sort first, then the Belgian (B-) etc.

This is fine, but the format for the full heading doesn't actually contain the siglum, only the full institution name and the shelfmark.

I could imagine something like Cividale del Friuli, Archivio Capitolare (I-CF), LVI where the RISM siglum is in parentheses at the end of the institution name. This is what we do in RISM Online, e.g., https://rism.online/institutions/30000053, "Leipziger Stadtbibliothek - Musikbibliothek, Leipzig (D-LEm)".

Then the sorting of the list would make sense, even though the siglum isn't the first component of the title. If that is what is wanted, then "(I-CF) Cividale del Friuli, Archivio Capitolare" is probably best.

dchiller commented 2 months ago

The preferred ordering was unclear to me.

To me too, which is why I think it makes sense to separate. As it is now, it can just look totally random which probably doesn't make sense.

The second option you list (eg. (I-CF) Cividale del Friuli, Archivio Capitolare) is how it now looks on the notation detail page which I think makes sense.

annamorphism commented 2 months ago

The preferred ordering was unclear to me.

i believe this was discussed a bit here? https://github.com/DDMAL/CantusDB/issues/1227

dchiller commented 2 months ago

@ahankinson One last question, I think:

In the description above you write that there is still work to be done to figure out "private collections" and that this will lead to some sources showing up as None. Locally, I'm not actually running into any trouble with private collections at the source level...I'm running into them at the "Institution" level.

I'm getting institutions that look like this:

I'm also getting other weird institutions. For example, here's the US library of Congress:

I assume that this latter has to do with the way the names of sources were in the original data. If the intention is that we'll have to do this manual clean-up, all fine, I just want to make sure that's expected.

dchiller commented 2 months ago

i believe this was discussed a bit here? https://github.com/DDMAL/CantusDB/issues/1227

Yes, now that I look at it, the source list page suffers from this too now (ordering by something that isn't visible). I guess whatever we choose for the source list page should probably carry over to the other pages that list sources too. Title will no longer be an option so we'll have to update what we want in #1227.

ahankinson commented 2 months ago

A “title” doesn’t really make sense when talking about MSS… at least it’s not a universally recognized concept. The identifiers we can sort on are:

shelfmark (but we don’t want all the “MS 1” to sort together so that’s not useful
Institution name
City (but may not be known for private collectors)
Siglum

Personally I find sorting by country code, then city, and then institution name to be the most useful, since it groups all the mss from the same institution, and groups all institutions in one city. As long as it’s understood how they’re sorted then that’s probably the best way.

dchiller commented 2 months ago

Thanks @ahankinson for all of this!!

From my pov, we should merge this if you are good and this makes sense https://github.com/DDMAL/CantusDB/pull/1545#issuecomment-2228959105

ahankinson commented 2 months ago

I just started afresh and re-ran the migration. Here's what I get:

Did you run python manage.py migrate_records? The other one, migrate_institution_source_records, got so complicated that I rewrote it but kept the old version for reference.

dchiller commented 2 months ago

I just started afresh and re-ran the migration.

Interesting...I thought I did but let me try again.

dchiller commented 2 months ago

The same thing happened to me again... @lucasmarchd01 have you tried?

Update 1: I didn't realize that @ahankinson had done significant updates to the data on production to accompany this PR. Local and staging databases need to be updated with a recent production dump for this to work.

Update 2: There are just a few institutions that probably need to be "fixed" in the data. Should we before merging?

ahankinson commented 2 months ago

I don't see any of those institutions in my data...

dchiller commented 2 months ago

I don't see any of those institutions in my data...

Looks like all are recent (past month) additions. I'm gonna merge, we'll fix them in the data later.

ahankinson commented 2 months ago

I did a fresh export from the cantus database today (the export is currently in /home/fedora on the server) and dropped / re-created the database on my local machine. So I'm not sure how they got there...

cd /home/cantusdb/code/CantusDB/postgres/
docker compose exec -T postgres pg_dump -U cantusdb cantusdb > /home/fedora/cantusdb.sql
cd /home/fedora/
tar -zcvf cantusdb.tgz cantusdb.sql

Anything wrong with that?

annamorphism commented 2 months ago

Looks like all are recent (past month) additions. I'm gonna merge, we'll fix them in the data later.

I think they are all part of a dump of private collection fragments I'm working on that's still very much in draft form! So that's me.

ahankinson commented 2 months ago

But... if they're not in the database dump I literally took from the production server today, where did they come from? Or am I getting the dump from the right place?

Edit: OK, I see them on production. They will need to be fixed post-migration.

dchiller commented 2 months ago

I did a fresh export from the cantus database today (the export is currently in /home/fedora on the server) and dropped / re-created the database on my local machine.

Looks like we have this figured out, but for reference, a new database dump is created every day in /home/cantusdb/backups/ so you don't need to do it yourself if you want the whole database.

ahankinson commented 2 months ago

OK!

DDMAL / CantusDB