Change source/manuscript IDs to match the IDs in Cantus Database

jacobdgm commented 2 years ago

related to Issue 429 on the CantusDB github page, I figured it would be a good idea to open the issue here, where the impactful change would actually occur.

Since Cantus Ultimus is more-or-less a nice user interface over top of the data on CantusDB, it might make sense to change the manuscript identifier numbers in Cantus Ultimus to match those in CantusDB - for example, CH-E 611 is currently 74 in CU, but 123606 in CD. This would make it easy to link to a manuscript from CantusDB, and it also feels like a thing that would generally make integration between the two sites simpler in other situations.

(I have not looked through all of the issues that are open on this repository, so if this is a duplicate, feel free to close it)

dchiller commented 2 years ago

@jacobdgm

The little note I found in the current manuscript import script in CU suggests that Cantus DB implements an API -- do you know of any docs about this? I realize that your docs also might be about new cantus, and that old cantus might be different.

jacobdgm commented 2 years ago

I'm not aware of any documentation for OldCantus's APIs. There is some documentation on their implementation in NewCantus on the CantusDB Wiki - please let me know if this can be improved in any way. And also let me know if there's additional information you would like to have provided by an API - if it's a JSON API, I should be able to add keys without breaking things, and I can also put together something new if it's needed.

dchiller commented 2 years ago

Well, it just seems to me that an API would be the ideal way for CU to get the manuscript data from CantusDB...it makes much more sense to me to request a json document from CantusDB than script the html as we are currently doing.

dchiller commented 2 years ago

In which case, I think there are two stages to this:

Modify the Cantus Ultimus manuscripts to match Cantus DB id's now -- maybe just through a data dump or a one-off json document.
Modifty Cantus Ultimus's import capability to get manuscript data from New Cantus so that when it's online, we can pull new manuscripts directly from CantusDB.

Thoughts?

fujinaga commented 2 years ago

Well, it just seems to me that an API would be the ideal way for CU to get the manuscript data from CantusDB...it makes much more sense to me to request a json document from CantusDB than script the html as we are currently doing.

Yes, I agree.

dchiller commented 1 year ago

@jacobdgm

I've looked a little more in the CantusDB endpoints, and it looks like the json-node/<source_id>/ endpoint will give us what we need.

I have two questions//comments//confirmations:

It doesn't seem like there is an endpoint that would pass back all the id's of a certain type (eg. all the sources in cantusdb), and it looks like a query to the /sources/ url returns an html document. In other words, it looks like currently I would still need to parse html to get all the source id's in the first place. Does that seem true to your understanding?
There is a note in the documentation of the json-node API about potential future changes. It doesn't seem to me like that would cause major issue (eg. maybe down the road we need to change the url path or something, but nothing that would render using the api for this purpose unusable). Does that seem correct based on your understanding?

jacobdgm commented 1 year ago

It doesn't seem like there is an endpoint that would pass back all the id's of a certain type (eg. all the sources in cantusdb), and it looks like a query to the /sources/ url returns an html document. In other words, it looks like currently I would still need to parse html to get all the source id's in the first place. Does that seem true to your understanding?

Yes, this sounds right. This is actually the main chokepoint to us syncing data with OldCantus - in our documentation on how to do this, it involves connecting to the OldCantus server and running SQL commands, e.g. "To obtain a list of all sources' IDs, run SELECT nid FROM node WHERE 'type'='source'; in mysql on the old Cantus server. "

A much better approach would be to have a /sources-list/ (or something similar) API that lists the IDs associated with sources. Is there anything we would want in this API other than a list of IDs? (I should ask Jan to set this up for OldCantus, come to think of it - it would simplify things quite a bit)

There is a note in the documentation of the json-node API about potential future changes. It doesn't seem to me like that would cause major issue (eg. maybe down the road we need to change the url path or something, but nothing that would render using the api for this purpose unusable). Does that seem correct based on your understanding?

I don't think it would cause a major issue, no. We'd just have to set up a URL for exporting sources that's different from the current, export-anything URL.

dchiller commented 1 year ago

A much better approach would be to have a /sources-list/ (or something similar) API that lists the IDs associated with sources. Is there anything we would want in this API other than a list of IDs? (I should ask Jan to set this up for OldCantus, come to think of it - it would simplify things quite a bit)

I don't think so, because once I have the ID, I feel like I can get anything else I need from the /json-node/ endpoint. I guess if you returned source id's and other source information at once it would me fewer API calls from CU, but I'm not sure it's worth effort -- from the CU perspective we'd only call the API initially and when sources change.

I don't think it would cause a major issue, no. We'd just have to set up a URL for exporting sources that's different from the current, export-anything URL.

Perfect!

jacobdgm commented 1 year ago

I mentioned this at the lab meeting - we definitely want to renumber sources in Cantus Ultimus to match those in Cantus Database. Once I finish the main project I have on the go - testing NewCantus staging and putting it up on production - I'll set up an API for this.

dchiller commented 1 year ago

Since my plan is to implement this change in the coming days, I'm going to reiterate my approach here, since it is slightly different that in my summary above.

[ ] Modify the current CU manuscript import process to make use of ID's from CantusDB, obtaining these ID's from the source list html in CantusDB (https://cantus.uwaterloo.ca/sources)
[ ] Once a source list api is available, adjust the import process to pull from the api rather than the html.
[ ] Once this new version is on production, re-import manuscript data using this new process so that production site manuscripts have new ID's.

jacobdgm commented 1 year ago

Once a source list api is available...

I was about to start writing one fresh, but saw we already have a json-sources API. You can find it on OldCantus, Production and Staging, and there's a bit of documentation on the CantusDB Wiki. NewCantus's implementation returns only published sources; I believe OldCantus does the same. Is there anything that you'd want an API to do that this one doesn't already do?

dchiller commented 1 year ago

Not sure why we didn't see that one before.... but yeah, that works.

dchiller commented 1 year ago

So now:

[x] Modify the CU import process to collect data from the Cantus DB API and make use of Cantus DB ID's.
[ ] Once this new version is on production, re-import manuscript data using this new process so that production site manuscripts have new ID's.

DDMAL / cantus

Change source/manuscript IDs to match the IDs in Cantus Database #652