TOMToolkit / tom_base

The base Django project for a Target and Observation Manager
https://tom-toolkit.readthedocs.io
GNU General Public License v3.0
23 stars 42 forks source link

Some Custom Targets require excessive memory allocation #966

Closed rachel3834 closed 1 week ago

rachel3834 commented 1 week ago

Describe the bug I migrated my TOM to use a custom Target model, and I'm now using the new cone search functions to associate targets that were ingested by different names. This process picks one target as 'primary', then combines all available data products, observations etc as well as custom target attributes, adding an alias name instead of a duplicate target.

For the overwhelming majority of cases this process has worked fine, but in two or three cases (out of >11,000) the resulting 'merged' target causes issues (502 Bad Gateway) when loading the TargetDetailView or even the admin page for that object. The stern logs report that the OOM killer has intervened.

In general, this leaves little debugging information to go on, other than the fact that the process seems to be dying when it is processing some of the custom target fields (formerly extra_params).

So this seems to be a case where there is some data in the custom target fields that is perhaps corrupted or didn't convert well. That's likely to be specific to my TOM system but I thought it would be interesting to record the experience in an issue.

jchate6 commented 1 week ago

@rachel3834 were all of these problematic objects the result of merged targets? Do you have a few examples that we could look at from before/after merging and conversion from target extras?

rachel3834 commented 1 week ago

After further investigation, I found the problem.
One of our custom parameters stores the covariance matrix from our lightcurve model fit. Normally this is a small 2D numpy array, but apparently there is some kind of failure mode where the output of this parameter is an extremely long string consisting almost entirely of "////////". Exploring the data for affected events through the management shell, I found these strings were 234881290 characters long, or ~112MB for a single parameter!

This parameter used to be stored in a string format as an extra parameter, but was converted to a JSON field during the migration to custom objects. I don't know if the bad string entries were caused by the original modeling code producing bad output, or the conversion process failing to parse bad input.

Either way, this seems to be very specific to my particular TOM, so it's not an issue with the Toolkit. I'm recording this in case anyone else runs into a similar error. For the record, the problem can be resolved by going into the manage.py shell, GETting the affected target and then manually resetting the parameter value to something reasonable and saving the target.