Open Zehvogel opened 2 months ago
One general approach to solving this that piques my interest is a "tree-shaking" algorithm. The user would specify which collections must be output. Podio then starts by marking these collections as "live". It follows all associations backwards, across all collections, marking everything it encounters as live. Anything not live at the end does not get written. This would strike a good balance between saving space and preserving associations and hence data integrity. If the runtime cost is too high, we could reduce it by having the user specify exactly which collections need to be pruned.
That sounds like an interesting approach. I think it could work, there might be some edge cases to be considered. One potential issue is the following: All objects are identified by their ObjectID
, consisting of a collectionID
and an index
into that collection. We would have to make sure that these are properly set before any writing happens. I think (and this needs to be verified) that this should work, because the final setting of all of these before we write things happens in prepareForWrite
, i.e. as long as things are pruned before that we should be able to get the index
set correctly.
As discussed during the EDM4hep meeting on Sep 10 we don't think a truly generic solution is possible or this. Hence, we decided that at least for the foreseeable future the developments in this direction will (and should) focus on implementing the necessary functionality to make things work for specific use cases where the expected outcome is well defined, e.g. skimming MCParticles
See also discussion in: https://github.com/key4hep/k4FWCore/issues/226