Updated migrations to handle unparseable data

edkeeble commented 1 year ago

Also reconvert data after switching columns back to JSONField.

This PR addresses a couple of issues with the previous attempt to store CMR data as JSON:

Data in certain fields weren't converting correctly
We converted the data and first stored it as a string in the existing TextField columns, then converted those columns to JSONField, but didn't reprocess the data again to store it as objects, so all data remained in string format.

I've run this on a dump of the production DB and it works correctly, with the exception of one record, which has an unparseable "false" value. My recommendation is to restore the staging DB from a dump of the production DB prior to deploying this and manually update that record, since it appears to be an anomaly:

d = DOI.objects.get(uuid='6a8e809b-c522-4029-a6b0-e28348a6bcf1')
d.cmr_dates
'[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": false}]'
d.cmr_dates='[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": False}]'
d.save()

Also of note: the cmr_data_formats column contains inconsistently formatted data. In some cases, it stores an array of strings and in others it stores a single string. The updated migration will force a single string to be stored as an array, which I think makes sense given the name of the column.

alukach commented 1 year ago

the cmr_data_formats column contains inconsistently formatted data. In some cases, it stores an array of strings and in others it stores a single string.

Do you feel like you have a solid understanding of how we are getting ourselves into these types of situations? Shouldn't we also add some logic into the application to ensure that we don't find ourselves back in this same situation in the future?

edkeeble commented 1 year ago

Yeah, that's a good point. I haven't updated the DOI Matching process to alter the data we insert into the DB. At a glance, the DOI Matcher still serializes all the data. I'll check if that matters in this context or if that's to be expected, since the DOI Matcher creates Change objects and stores everything in the update field.

The only references I see to cmr_data_formats seems to suggest that they are always stored as an array/list, so I'm not sure how we ended up with plain strings in the DB. We could check the CMR API and see if the issue is at the source.

alukach commented 1 year ago

My recommendation is to restore the staging DB from a dump of the production DB prior to deploying this and manually update that record, since it appears to be an anomaly:

d = DOI.objects.get(uuid='6a8e809b-c522-4029-a6b0-e28348a6bcf1')
d.cmr_dates
'[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": false}]'
d.cmr_dates='[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": False}]'
d.save()

This confuses me a bit. What is this encoding? false seems like a fine JSON boolean, but then all the keys are double-escaped. Any idea of how we got here?

edkeeble commented 1 year ago

Any idea of how we got here?

The double escaping is the result of the issue around the model form passing a string to json.dumps (https://github.com/NASA-IMPACT/admg-backend/issues/389). I made a mistake in thinking I had to correct that false value, though. What I actually need to do is remove the escape characters prior to attempting to load the json string in the migration. That allows this value to convert properly.

NASA-IMPACT / admg-backend

Updated migrations to handle unparseable data #534