Closed rob-c closed 8 years ago
_setDirty(False)
has not worked since at least 2009 so we really shouldn't be using it anywhere. In the future I'd much prefer either using setDirty
/setFlushed
or a @property
so we can just do obj.dirty = True
/obj.dirty = False
.
It's possible that there's an issue at the moment that if you set a schema attribute to the same value it already contains that it will still set it dirty which it probably should not be.
I'm seeing that empty objects are being marked dirty which needs addressing.
wrt the _setDirty(False)
I plan to remove this in that case, promising, or allowing this is a bad idea if it doesn't work. I'm preparing a short PR which doesn't load jobs it shouldn't.
Just to say that I think the mail that sparked this on the DAST list may have been a red herring as I suspect this may have been due to missing index cache files instead. The warnings given by the user are actually on startup not shutdown and at the top, you can see that it's complaining about a missing data
file. It would certainly try to load at this point if the index cache file wasn't there and I suspect that's what's causing the problem.
Out of interest, what's the problem about marking an Empty Object dirty? Doesn't it only become a problem if the object is attached to a Registry/Repo which an Empty Object isn't? Therefore, you can happily change stuff on an Empty Object, it gets marked dirty but that doesn't matter as it will never be picked up for flushing (or loaded for that matter).
There would be a problem if you have an object attached to a registry that is lazy loaded (so I guess just Jobs) that has been marked Dirty BUT hasn't been fully loaded yet. I'm assuming this can't happen for jobs though I haven't gone through the code to specifically check.
Dirty empty objects will get picked up for flushing because the registry is populated with empty objects capable of triggering a load from disk when interrogated for a schema attribute they don't have.
OK, with regard to indexes... It is possible (although difficult) to avoid re-generating missing indexes on startup/shutdown. There is a master index which reduces the time taken to load all of the indexes separately for an registry with many objects which could potentially be modified to reduce the amount of jobs it requires to be loaded and I'll look into this tomorrow. If the index is missing Ganga will now always attempt to fully load the job for future to avoid having to load it again in the future. In the past we handled indexes badly so I suspect 'problems' some users see is the effect of the Registry auto-repairing itself.
As regards indexes (indices? whatever :smile: ) I think the problem the user was seeing was actually Ganga working as intended. It's not perfect but to be expected if users keep 2 year old jobs lying around. I think we can forget about that at least.
As regards the EmptyGangaObject
thing, I think I understand this now. Is it that if a job fails to load because of plugins not being present, it's replaced by an EmptyGangaObject
which you then never want to mark dirty because it might overwrite the original Job object? If so, can we just overload the _setDirty
method for EmptyGangaObject
and have it not set the dirty flag?
I somehow think yes, there's little we can do to fix that particular index behavior. Although a nice little warning of what's going to trigger a load on shutdown may be nice. Again this requires tracking if an object has been loaded much more clearly than we currently do.
After looking at the code the EmptyGangaObject
class is never used anymore I think over creating an empty instance of the class which has been stored in the Registry.
(An EmtpyGangaObject
used to be created which was then mocked up to look like the object in question which caused some level of problems due to member variables and such not being fully initialized in all cases).
As such it's not possible to easily disallow the object from being marked dirty without storing knowledge of whether the object has been fully loaded from disk within the object.
Ultimately I'd like to redesign all of this store a _inMemory
boolean in each GangaObject.
This would default to True when an object is created in memory and is set to false when an empty instance is created by the Registry.
This would mean the __get__
method can more cleanly ask for a job to be loaded if required rather than by inspecting the object itself (which is currently done).
(I'm much more of a fan of moving to make explicit calls to the Registry to load
an object from the Registry if needed rather than this current mechanism of _getReadAccess
and _getWriteAccess
which confuses the matter somewhat. I'm not sure if the read/write distinction was important in the past but due to SessionLock
I'm not sure we can properly have read access to the same job from multiple ganga sessions.)
I think this has been fixed in some PR or other, my concern was around jobs being instantly dirty after being loaded as well as jobs being created even when not correctly loading from disk. If this comes back to be a problem I'll re-open this.
LHCb users are seeing what I saw that some objects are marked as dirty when they should never be loaded.
The API for the
_setDirty
also doesn't respect_setDirty(False)
which is a slight annoyance. If we're changing to use_setDirty()
and_setFlushed()
we should not respect the old API.I'm tracking down what is causing some jobs to be loaded on shutdown which we thought was fixed at some level. (It appears this may only be fixed for simple jobs)