remap_user_ids command is not working properly

jacobdgm commented 10 months ago

I tried running our new remap_user_ids command on staging, and upon asking @annamorphism to verify that things ran properly, she pointed out several issues:

http://206.12.88.113/source/713037 now has no Contributor (compare to https://cantusdatabase.org/source/713037)

http://206.12.88.113/source/711225 - My Project Manager test account has been added as a contributor on this and perhaps other sources. This is clearly an error. (compare to https://cantusdatabase.org/source/711225)

One thing to do before testing this again is to download a copy of the database from Production and put it on Staging before testing again. (I hadn't done this, thinking the data was probably close enough to up-to-date. But relevant users have been making changes lately, so this should really be updated.)

jacobdgm commented 10 months ago

The path forwards:

We need to deploy certain changes currently on Staging to Production as soon as we reasonably can, so I think we should disable this script to ensure it doesn't get run on Production, and return to it sometime in the coming weeks. I'll open an issue for this.

lucasmarchd01 commented 10 months ago

I ran this script locally and found that (at least for the two sources mentioned in this issue) source 713037 has a contributor this time, and Jacobs project manager test account was not added to source 711225. I also went through the code in debug mode and everything seemed to be working as expected.

It's possible that the issue has to do with data discrepancies between production and staging. The next step will be to use a fresh copy of the database from production.

lucasmarchd01 commented 10 months ago

I have just run the remap_user_ids script on staging. From what I can tell, it has worked as expected. Can @jacobdgm or @annamorphism confirm this?

jacobdgm commented 10 months ago

My working hypothesis is that within the past few weeks, I changed the contributor of http://206.12.88.113/source/711225 to something nonsensical (i.e. my "Project Manager" account) on staging, since it's staging and it doesn't really matter what happens with the data on staging, and then forgot about it. If this is the case, the wonky contributor assignment of this source had nothing to do with the script, and it's been overwritten since Lucas overwrote the database with a copy of what's on Production.

It doesn't explain the contributorless sources, though. Maybe the script didn't run to completion when I first ran it on the staging server?

annamorphism commented 10 months ago

So, the correct users seem to be showing up now! This is good. Is it possible the results above were somehow the "last updated by" showing up where we wanted "Creator", and that this has now been fixed? Jacob is still the "last updated by" for several sources (as shown in Content Overview) for whatever reason. (for that matter, I'm also the updater for some things I don't remember touching....I'm assuming every save of the edit page is counted as an "update" even if no changes were recorded. But I don't really think that matters.)

jacobdgm commented 10 months ago

I'm assuming every save of the edit page is counted as an "update" even if no changes were recorded.

Yes, this is how it currently works.

Is it possible the results above were somehow the "last updated by" showing up where we wanted "Creator", and that this has now been fixed?

We have not recently made any changes to this policy, so this seems implausible.

How do we feel about the current state of things, @annamorphism, @lucasmarchd01 and @dchiller? I feel hesitant to re-enable this function to eventually run it on Production - while the command seems to be working properly, we haven't made any changes from when it didn't seem to be working properly. It would be nice if we had a record of the data on Staging before and after running the command, to ensure that the data previously on staging was indeed funky.

At the very least, we should save a backup of the database on Production immediately before running this command.

dchiller commented 10 months ago

How do we feel about the current state of things, @annamorphism, @lucasmarchd01 and @dchiller?

I think I'm not totally clear on the "current state of things." That running it the first time led to weird results but when it ran again things seem to be good?

At the very least, we should save a backup of the database on Production immediately before running this command.

Yes, definitely.

This may not be possible, but one thing that might help is if you did a data dump from production, and updated staging with it so that the users will be un-mapped again on staging so you can test it out again. You'd also then have a data dump from production.

jacobdgm commented 10 months ago

I think I'm not totally clear on the "current state of things." That running it the first time led to weird results but when it ran again things seem to be good?

Sorry, I could have been clearer. Yes, this is exactly what I meant.

This may not be possible, but one thing that might help is if you did a data dump from production, and updated staging with it so that the users will be un-mapped again on staging so you can test it out again. You'd also then have a data dump from production.

This has the makings of a good idea, but I'm not sure I fully understand what we would hope to accomplish by loading the before-database onto staging. If we had questions about how the command had performed when it was run on production, we could always load it onto a developer's local version of CantusDB. Loading it onto staging would allow others (like @annamorphism, who doesn't have a local copy of CantusDB) to debug - is this the purpose?

dchiller commented 10 months ago

Thanks for the clarification -- no, my turn to be less than clear.

If we had questions about how the command had performed when it was run on production, we could always load it onto a developer's local version of CantusDB.

Yes, this is what I'm getting at, so a local version works just as well from my perspective. The point is if it runs successfully on this local version that mirror's production (so with the users pre-mapping as it were) you'd have another data point that what happened the first time on staging was a fluke (and you'd have the back-up from production you mentioned ready to go).

jacobdgm commented 10 months ago

so you're suggesting we do a database dump on Production, load it into a local instance (or Staging, same difference), run the command and verify it ran correctly, and then run it on Production immediately since the data will be almost identical?

dchiller commented 10 months ago

Yes, just as an added pre-check.

annamorphism commented 1 month ago

This problem has reappeared (or maybe was never fully fixed? unclear.) Some sources are still showing up with mystery contributors.

I checked all the sources in Helsinki, because there are a useful mix of sources made at various points in our migration. What's odd is that there are sources from November 2022 with the right user as well as ones with the wrong one. The correct ones are also published, but that happened in May, so unless something happened over the summer that reverted unpublished sources but not published ones I'm hard-pressed to explain the situation. (I experimentally published one of the problem sources to see if it would magically re-assign to the correct user. It did not, and I have since unpublished it again.)

dchiller commented 1 month ago

Should Lotta be the "creator" for all of these? I'm wondering if all of the Lotta-created ones are correct and the ones that are incorrect had a different creator? Or are some Lotta-created sources correctly attached to Lotta and some not.

dchiller commented 1 month ago

Also, I notice that you are the "last modified by" for the correct/published ones. Is it possible that you just corrected this field manually?

annamorphism commented 1 month ago

Also, I notice that you are the "last modified by" for the correct/published ones. Is it possible that you just corrected this field manually?

I don't think I can modify the 'created by' field....

dchiller commented 1 month ago

I don't think I can modify the 'created by' field....

Ah, right...that would make sense.

annamorphism commented 1 month ago

Should Lotta be the "creator" for all of these? I'm wondering if all of the Lotta-created ones are correct and the ones that are incorrect had a different creator? Or are some Lotta-created sources correctly attached to Lotta and some not.

The Lotta-created ones are all correctly attached. But some Hilkka-Liiisa sources are correctly hers (like F.m.I.26) and others are not. And she shows up correctly in edits; if I remember the discussions last winter, it was specifically the "creator" field that was not correctly changing to the 'new' user.

dchiller commented 7 hours ago

Ok, I have some updates into my investigations of this:

1) I just ran the command as is locally, and everything seemed to proceed as expected. The two "problematic" sources mention at the top of this issue seem to be correctly mapped. Jacob's account does not appear as a contributor to 711225 and 713037 now has "Emma Leidy" as a contributor. This seems to agree with an earlier comment by Lucas that the command ran successfully for him locally.

2) I think I have an explanation for what we see in @annamorphism's spreadsheet above: a) As hypothesized in the discussion above, this command was never run on production. This explains why many sources do not seem to have been mapped and seems to agree with the most recent changes to the command, which had the warning that it was too buggy to run. b) The contributors of the two published sources (711225 and 712830) were corrected in the process of publication manually, rather than through the mapping command. In the edit history of those sources, we find records of Debra changing the contributor manually. In both cases, on December 15th, 2023, Debra changed the created_by field. Prior to this manual change, the contributor was incorrect (and therefore agrees with the hypothesis in 1 that the command was never run on production). c) The last three sources in the screenshot (1000012, 1000041, and 1000042) have the correct created_by user because those sources did not exist on Old Cantus. Since they were created on New Cantus, the created_by field was populated correctly at the time they were created. We can tell they were created on New Cantus because they have ID's > 1000000. I also verified this by searching through the Old Cantus data for these shelmarks and can't find them. In sum, we can explain the current state of these sources by the fact that the command never run, in the course of publishing the published sources the created_by field was manually corrected, and the only sources affected were exactly those that were originally created on Old Cantus.

3) A few mysteries remain: a) The two published sources (711225 and 712830) were created on Old Cantus by the same user (user number 251612; intials: HV). Had the command been run successfully, they would have been mapped to this user's New Cantus user ID (251662). However, they were manually corrected to have different "Contributors". In this case, 711225 has the "correct" contributor according to what's in Old Cantus, while 712830 has the "incorrect" contributor according to Old Cantus. I suspect this is because Debra, who did the manual correcting, had a reason for assigning them like this; this simply gives another reason to think the automatic mapping never occurred (and that we might need to make manual adjustments to the results of any automatic mappings in the event there are other cases like this). b) How did Debra change the created_by field!?! I can't seem to do it and one shouldn't be able to...

4) If we go back to the source that led to this discovery in September (712970), we see that on production its current contributor is ME (251620). When the command is run locally, this source gets mapped to a contributor (EL; 251669. A look at the source code for the command suggests this is intended: https://github.com/DDMAL/CantusDB/blob/cc14cf5f1d9b78a27eb40bc3fff1a61c6c689367/django/cantusdb_project/main_app/management/commands/remap_user_ids.py#L31 So, I guess the question is: does this seem right?

annamorphism commented 6 hours ago

OK, so the manual change (point 2b) explains my confusion about why some older source did seem to have correct users (though I am likewise mystified about how it was possible to do, point 3b.) The mapping seems correct in point 4.

dchiller commented 6 hours ago

Solved 3b! It turns out that on December 7th, 2023, @lucasmarchd01 fixed a bug whereby the created_by field was not read only for sources (see https://github.com/DDMAL/CantusDB/commit/1f52b5ca576c2bfb300c9376f55287266e6d2c60).

This was only merged to production on January 11, 2024 (see https://github.com/DDMAL/CantusDB/pull/1232/files)

So at the time Debra made this change, it was possible to do this.

dchiller commented 6 hours ago

Given all this, I feel fairly sure that:

anything that has been manually fixed on production is fine even if you technically aren't allowed to do that any more (the only people who could do it are Debra-type people, so we can assume that they know what they're doing).
the command works well enough as expected that we should try running on staging again, at least to determine what, if any, these unnamed bugs were that caused us to put it aside in November 2023.

@annamorphism @lucasmarchd01 agreed?

annamorphism commented 6 hours ago

I think this sounds good!

dchiller commented 4 hours ago

It looks to me like it ran successfully on staging (at least for sources; chants are still running). It is taking quite a long time though, so I wonder if what happened originally is that it didn't run to completion. The command could be significantly more efficient (we don't need to iterate through all chants, we just need to iterate through the chants or sources created by the few users affected), so we might want to make that change and try again.

DDMAL / CantusDB

remap_user_ids command is not working properly #1165