OpenMS / OpenMS

The codebase of the OpenMS project
https://www.openms.de
Other
478 stars 317 forks source link

Refactoring of Kernel classes: Memory footprint #1908

Closed hroest closed 5 years ago

hroest commented 8 years ago

Maybe we should begin with an analysis of the memory footprint of some kernel classes like I did here https://github.com/OpenMS/OpenMS/pull/1334#issuecomment-139043339 for peptidehit and identify some of the "worst offenders"

also, in #1610 I found that simply clearing the hulls of Features already made a huge difference and this was a case where clearly the subordinates and the hulls were not needed (https://github.com/OpenMS/OpenMS/pull/1610/commits/d3be349fa7de8ea01fc736a0eecbb6f347db8d34) and actually performance improved when spending some effort to convert a Feature to a ConsensusFeature since the latter had less memory footprint https://github.com/OpenMS/OpenMS/pull/1610/commits/0c11fdd2d6af7f3138b2a9fb1e1676f1bfab85e2 just to give some examples of hoops one has to jump through to improve memory. Of course, this also means that many algorithms require the input to be of a certain class (like Feature) but only use a small subset of the data structure and could well work on a much more lightweight version. In some cases, e.g. alignment, where data only needs to be read and not written in full XML format with all the Metavalue stuff attached, it is actually reasonable to read in a full data structure and then convert it to a more lightweight data structure just because we have to read dozens to hundreds of runs.

We had the same issue for the TraML data structure where we also ended up using LightTransition etc for the actual computation because keeping the full datastructure that maps to the XML file around was too heavy.

hroest commented 8 years ago

we should also consider the cost of some of our objects, especially those that are inherited / created empty everywhere. For example, having an empty MetaInfo or an empty AASequence cost 48 bytes is just a waste especially as they appear so often and sometimes it is even unclear that you are paying for them since they are inherited from somewhere. So we could / should 1. make them cheaper if possible and 2. replace them with pointers so that they are only costing 8 bytes if not used. The other, better option, is of course if the algorithm knows what it is going to use and then the TOPP tool copies the appropriate data into a lightweight struct which leads to (i) better and clearer interfaces (ii) separation of algorithm and data storage (XML storage with stupid free-text userparams / cv params) and (iii) reduction of memory footpring. Even better if we can identify a "common subset" for multiple algorithms that then all work on the same lightweight structures.

hroest commented 8 years ago

Regarding AASequence, we currently have

so basically, to store PEPTIDE we use up 104 bytes whereas in ASCII we could have stored the whole thing in 7 bytes, so that is 14x larger (or 93% overhead).

How could this be improved? many ideas, but here is one. We assume that most likely we wont have more than 128 residues, ever. So our new AASequence is just

class AASequence
{
uint8_t * sequence_;
}

and we can rewrite ResidueDB to take 8bit integers and translate them to Residues and return a ptr to it. If we do this in a lookup table in the header, this could be quite fast (not as fast though as now where we store ptrs to Residues directly).We could also add a bitfield that allows us to read out whether the first / last residue are a modification or really the first/last residue. We can even add a second pointer just in case the first one is not enough and we actually have more residues than we can map:

class AASequence
{
uint8_t * sequence_;
uint32_t * sequence_large_;
}

where only in exceptional circumstances we would use sequence_large_ (e.g. if somehow the total number of residues exceeds 128). It adds 8 bytes overhead. Of course the code may get ugly as we would always have to test which array is populated.

This would allow us to store PEPTIDE with 9 bytes instead of 104 (2 bytes for the two pointers and 7 bytes for the 7 uint8_t). Or in a crazy case we would use 30 bytes in total (in case we have to store 32 bit for each residue). This would still mean a reduction of 11.5 or 3.5x in memory for either case. Of course that is just a crazy idea but it would work and would reduce memory by over 10fold in most cases. If AASequence is ever the memory bottleneck, this would help and still allow us to have all sorts of residues and modifications.

timosachsenberg commented 8 years ago

Very interesting, some minor comments: I think AASequence should not have a vtable as it doesn't use runtime polymorphism. I like the string representation (maybe even a c string representation as it is supported in newer SIMD instruction sets) but considering that the registry for the residues is a static database we might run out of 8 bits pretty quickly. Another problem we would have to deal with are our tight integration of mass tags in OpenMS. They easily lead to an explosion in numbers of registered residues. While 32bit or 64 bit is most certainly sufficient for these cases we will probably not save a lot of memory.

In certain cases, specialized solutions may circumvent memory problems. E.g. in my cross-link search engine I store a string view (pointer + sequence length) into the fasta database, as well as an integer n that indicates the modification state of the peptide. Reconstructing the actual modified peptide is rather slow if (e.g. 1s for 10000 result peptides) as all combinatorial permutations need to be recreated, the n-th state selected and the AASequence object reconstructed. In this case it the extra seconds for the whole search is no problem as it is only done to write results to a file but for real time calculations it would be to slow.

hroest commented 8 years ago

with 32 bit we would at least have a 2x memory improvement. However, why are 8 bits not enough? I dont see why we should have that many residues, can you try and count the number of residues in one of your practical applications? I think for most applications that I use I only have 20 AA + Mox + Cmod and maybe sometimes phospho -- all which is well below 128 or 256 that we could encode with 8 bits.

However, first we need to make an assessment whether AASequence is really the memory bottleneck and if so, for which applications. Just because we can make it more memory efficient (and more complicated) does not mean we should. Probably some large-scale MapAligner and FeatureLinking is currently the most memory-consuming step in OpenMS, both of which read potentially IDs (depending on the algorithm). Howeve, we should benchmark which structures use most memory in these applications and see how to fix them.

timosachsenberg commented 8 years ago

If I model all RNA oligos + losses I get easily 1000 mods. But this is a special application and I agree that 8 bit would cover the 99% of all use cases.

hroest commented 8 years ago

yes, this is probably not a standard use case and with 16 bit you should still be covered ... so we could use a

uint16_t * sequence_large_;

my idea would be that this can be handled transparently, e.g. the code switches over to to the larger array as soon as the numbers dont fit any more...

timosachsenberg commented 8 years ago

Or switching to the old implementation if we run out of residues would also be an option (also via ptr). Might be easier to implement and we probably are still closely tied to the way we handle residues in ResiduesDB. Maybe I would even prefer this way of doing it.

hendrikweisser commented 8 years ago

I didn't follow all the details of this discussion, but some comments regarding AASequence: Many file formats store the unmodified sequence as well as a separate mapping of sequence positions to modifications (if necessary). Maybe worth considering doing this internally, too? Also, if information about the relevant modifications (e.g. from search parameters) is available for a set of sequences, the fixed modifications don't have to be stored explicitly with the sequences.

timosachsenberg commented 8 years ago

Hi Hendrik, These issues have been created to collect discussion material on the class level as these topics come up from time to time and we might benefit from tracking them. There are many ways of representing this information and there exists none that works best with every use case. As you pointed out - given prior knowledge like fixed modifications or in my RNPxlSearch case even for variable modifications explicit storing of a full pointer to a residue can be prevented. If we want to change our KERNEL structure of AASequence we probably need a general solution that works without these "shortcuts". The nice property of your proposal is that a unmodified AASequences are nicely represented as e.g. char* and this structure would work fast with algorithms that can work on strings (e.g. digestion of unmodified fasta sequences). On the other hand you probably need more memory (map / or even linked list overhead is very large considering that many peptides are modified (in the case of implementing a search engine the vast majority are modified)). While I like this design a lot (e.g. because it is pretty clean) it would come with quite a lot of chances as in OpenMS we don't track unmodified residues and modifications separately but use a single Residue* to track both modified and unmodified residues.

hroest commented 8 years ago

Looking at how efficient encoding works in other places (e.g. UTF8), one efficient solution could also be to store an 8bit char of which the first 127 numbers are mapped to the most common AA and their modifications while a number larger than 127 means that we have a residue not currently encoded by the 127 positions. This is somewhat similar in spirit to what many file formats (used to) do with using small case letters for modified residues or add extra qualifiers (e.g. write PEPT*IDE or PEPtIDE for phospho T)

of course all these ideas make the implementation and usage much more complicated than a simple vector<Residue*> - but also more efficient

Another question that I raised in the beginning is whether it makes sense to have a one-fits-all solution or whether different applications may need specialized implementations (as Timo as done for the search engine). I think architecturally the current solution is quite "nice" but its inefficient in memory -- however only a few tools may ever find this the limiting factor.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.