GRIFFINCollaboration / GRSISort

A lean, mean, sorting machine.
MIT License
21 stars 54 forks source link

Memory usage during fragment tree creation #652

Closed jhenderson88 closed 8 years ago

jhenderson88 commented 8 years ago

Working with the latest version of the master. GRIFFIN data (only clovers, no auxiliary detectors). Run number 5759 (grsmid00/data3/griffin/90Sr_2016).

If I convert a 4GB midas file to a fragment tree, GRSISort maxes out on my machine's (Belodon) RAM (32GB). Extrapolating to absolute RAM usage implies that to convert a 4GB .mid file to a fragment tree requires almost 40GB of memory. The final fragment tree size is also approximately 4GB.

Based on the relative sizes of GRIFFIN events (calculated by @SmithJK) in .mid format and FragmentTree format, we expect that the uncompressed FragmentTree file should be about twice the size of the original midas file. Which is still a long way short of what is required to use up 32GB on a 4GB file.

Do we understand why this is? It sounds like an issue in the unpacking to me (rather than just inefficient use of data types).

For large data sets the .mid to FragmentTree conversion seems to be the bottleneck in terms of data processing (see, e.g. Adam's ELog: https://elog.triumf.ca/Tigress/GRIFFIN/4752).

The same thing happens when sorting the files on grifuser/grsmid01 (although the memory doesn't max out due to there being more of it).

r3dunlop commented 8 years ago

I think a lot of this depends on how root works. I would not be shocked if the following is happening: We load the entire Midas file into memory to process. We create fragments out of midas events. We tell root to write the fragment and it makes a copy and writes the copy while holding the root file in memory until it is told to close. It is also compressing which definitely has to hold a copy of the information. You can take a smaller fragment file and print the uncompressed size of a tree using FragmentTree->Print(). I get an uncompressed tree of 853 MB from a 478 MB run for one of my runs. It is eventually compressed to 158 MB. So 2:1 is a pretty good estimate.

Perhaps we have memory fragmentation problems? @VinzenzBildstein Is our current container sequential/do we allocate on the fly or do we allocate once at the beginning? I haven’t looked at the code in a while so I don’t know. It could also be that our buffers are too bug for branches, etc.

In TMidasFile there might be a “cat pipe” in there right now which is not ideal, but was a way to overcome super slow I/O from midas files when it was only one fragment per event. This was a terribly uncomfortable hack. This problem might be gone now, so this can be removed. That might save loading the Midas file in as memory.

I personally think it’s going to be very hard to playback data at the same rate the DAQ writes at, which the Elog says is our goal. The compression just takes a while to do and we can’t parallel process that unless we do not care about conserving the order of events. One could consider going Midas->Analysis and just making our analysis trees a bit bigger, but I’m not sure if that’s a good idea either. Either way, I don’t think memory is slowing us down, I think compressing and writing to disk is. I’m really not sure how to get around that right now. I can’t really look into this for a few weeks. But if it’s still a problem I’ll look then. For now can you just lower the size of the subruns?

On Jan 13, 2016, at 6:15 PM, jhenderson88 notifications@github.com wrote:

Working with the latest version of the master GRIFFIN data (only clovers, no auxiliary detectors) Run number 5759 (grsmid00/data3/griffin/90Sr_2016)

If I convert a 4GB midas file to a fragment tree, GRSISort maxes out on my machine's (Belodon) RAM (32GB) Extrapolating to absolute RAM usage implies that to convert a 4GB mid file to a fragment tree requires almost 40GB of memory The final fragment tree size is also approximately 4GB

Based on the relative sizes of GRIFFIN events (calculated by @SmithJK https://github.com/SmithJK) in mid format and FragmentTree format, we expect that the uncompressed FragmentTree file should be about twice the size of the original midas file Which is still a long way short of what is required to use up 32GB on a 4GB file

Do we understand why this is? It sounds like an issue in the unpacking to me (rather than just inefficient use of data types)

For large data sets the mid to FragmentTree conversion seems to be the bottleneck in terms of data processing (see, eg Adam's ELog: https://elogtriumfca/Tigress/GRIFFIN/4752)

The same thing happens when sorting the files on grifuser/grsmid01 (although the memory doesn't max out due to there being more of it)

— Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/652.

jhenderson88 commented 8 years ago

We are lowering the size of the subruns back to 2GB now having realized this so it isn't a hugely urgent thing to fix. The memory usage seems unnecessarily high though and prevents you from doing things like running multiple iterations on multiple cores to speed up the data processing.

The compression factor for these fragment trees is 5.9. So ~24GB for an uncompressed file. Which adds up to the memory usage we're seeing.

Of that, a relatively significant amount of space is being used for things like wavebuffers, despite the fact that these aren't in the data stream. The complete output is attached below:

fragtree_print.txt

I think you're right that the stated playback goal is going to be tough to hit while maintaining the flexibility that we want in the analysis.

r3dunlop commented 8 years ago

What were you determining the size of a TFragment to be? Without filling vectors etc, they should be 304 bytes each. One of my uncompressed trees is actually only 62% of the expected number (fragments * 304). However, it says that my 500 MB midas should hold 4.4 GB worth of data in it before any sort of compression. This is right on par with what you are seeing. How did you calculate 2x?

r3dunlop commented 8 years ago

It's quite easy to see them explode over the size of a Midas file. A midas "fragment" is 32 bits * roughly 10-14 data words, so lets say 40 bytes (this doesn't include midas headers). However, the midas header alone is written once to a midas event, but then shared among many fragments

Midas Time stamp = 8 bytes MidasId = 4 bytes = 12 bytes extra added onto almost every single fragment compared to a midas file

The first word of a fragment alone (4 bytes) gets put into Number of filter patterns : 2 bytes data type: 2 bytes pile up type: 2 bytes address: 4 bytes detector type: 2 bytes = 12 bytes compared to 4!

The K-Value of 10 bits, gets expanded to 16 bits per hit.

The vectors, however, seem to be the real problem. The charge vector which should hold 4 bytes per hit, is on average 18 bytes per fragment in my data. Even the empty vectors push into the 12 bytes per hit range. This would be the first place to look to reduce memory IMO (they also don't get compressed well).

SmithJK commented 8 years ago

Here's my calculation of a MIDAS event vs. a TFragment. It's based on the newest event format, where the KValue, for example, now is allotted 14 bits in MIDAS.

Jenna's notes on MIDAS vs. TFragment size

r3dunlop commented 8 years ago

hmmm. I'm not sure why it's so different. There are all of the things in memory whether you write them to disk or not.

MidasTimeStamp 0 -> Timestamp of the MIDAS event MidasId 0 -> MIDAS ID TriggerId 0 -> MasterFilterID in Griffin DAQ FragmentId 0 -> Channel Trigger ID TriggerBitPattern 0 -> MasterFilterPattern in Griffin DAQ NetworkPacketNumber 0 -> Network packet number ChannelNumber -26215 -> Channel Number //ryan is going to delete this. ChannelAddress 4294967295 -> Address of the channel Cfd ->31c0f20 -> CFD of each pileup hit Zc ->31c0f38 -> ZC of each pileup hit ccShort ->31c0f50 -> Integration over the waveform rise (Descant only?) ccLong ->31c0f68 -> Integration over the wavefrom fall (Descant only?) Led ->31c0f80 -> LED of each pileup hit Charge ->31c0f98 -> The Integrated Charge TimeStampLow 0 -> Timestamp low bits TimeStampHigh 0 -> Timestamp high bits PPG 0 -> Programmable pattern generator value DeadTime 0 -> Deadtime from trigger NumberOfFilters 0 -> Number of filter patterns passed NumberOfPileups 0 -> Number of piled up hits 1-3 DataType 0 -> DetectorType 0 -> Detector Type (PACES,HPGe, etc) ChannelId 0 -> Threshold crossing counter for a channel KValue ->31c0fd0 -> KValue for each pileup hit wavebuffer ->31c0fe8 -> waveform words *fPPG ->0 !<! NumberOfHits -1717986919 !<! transient member to count the number of pile-up hits in the original fragment HitIndex -1717986919 !<! transient member indicating which pile-up hit this is in the original fragment fUniqueID 0 object unique identifier fBits 0x03000000 bit field status word

You are missing the 8 byte ppg pointer, the hit index and number of hits, the tobject unique id and fbits.

r3dunlop commented 8 years ago

just doing a sizeof(TFragment) shows they are 304 bytes. A TObject gives you 16 bytes right off of the top.

r3dunlop commented 8 years ago

Data members don't determine the entire size of an object by themselves.

http://www.cprogramming.com/tutorial/size_of_class_object.html

We should consider optimizing (all classes) in the near future though. I'm not sure if the compiler does this for us or not.

VinzenzBildstein commented 8 years ago

@r3dunlop, we're only using the multiset during the "analysis" step (it's really more of an event building), and sizeof doesn't count vectors, so 304 bytes is without any charge, zc, ccShort, ccLong, Led, Cfd, or kvalues. But I think it is reasonable that a TFragment is bigger than the midas data.

One big question for me is why a vector without any entries takes so much space in ROOT. And maybe someone should use valgrind to see where the memory usage is.

r3dunlop commented 8 years ago

@VinzenzBildstein I know it doesn't count vectors. So the 304 is still much bigger than the above calculations. I really think that we should try to align our data members. We might win a significant amount of memory back just with that simple task.

SmithJK commented 8 years ago

I absolutely agree that is reasonable for a TFragment to be larger than a MIDAS event, but I think we can definitely pare this down. TFragment only inherits from TObject, which is 16 bytes.

The next GRIF-16 event format will not have multiple interactions per MIDAS event. Will the GRIF-4Gs output multiple interactions per event? If not, these vectors could potentially be changed to single values. Though this would be disastrous for backward compatibility.

r3dunlop commented 8 years ago

Vinzenz and I were just talking and we think this is what is happening. All of which will break backwards compatibility unless someone writes some streamers.

Some of these are probably memory reduction, while some are disk size reduction. Vinzenz pointed out that we can actually get pretty close to our design goal by sorting 3 or 4 midas files at once which requires a reduction in memory. The reduction in the size on disk helps us out linearly from what I've seen in the past. Writing half as much to disk takes half the time. This was the motivation for lean grsi.

SmithJK commented 8 years ago

There are some other data members that can be "crushed down" as well.

"All of which will break backwards compatibility". Here, is backwards compatibility being defined as "able to read/user ROOT files generated with older versions" or "able to sort MIDAS files with older data formats"?

VinzenzBildstein commented 8 years ago

I think in most cases it means "able to read/user ROOT files generated with older versions", unless we make a mistake.

r3dunlop commented 8 years ago

You have to be careful with reducing the size of some data members. If you want to use the same Fragment for Tigress, you must have a 32 bit address for example.

On Jan 14, 2016, at 4:06 PM, SmithJK notifications@github.com wrote:

There are some other data members that can be "crushed down" as well.

"All of which will break backwards compatibility". Here, is backwards compatibility being defined as "able to read/user ROOT files generated with older versions" or "able to sort MIDAS files with older data formats"?

— Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/652#issuecomment-171779778.

VinzenzBildstein commented 8 years ago

I've tested replacing the vectors for Charge, KValue, Cfd, Led, Zc, ccShort, and ccLong with variable sized arrays (all the same size, which is NumberOfPileups). Doing this reduces the size of the tree for a 2GB midas file from 7.77 GB to 6.23 GB, which is about 20% less. The memory usage of grsisort peaked at 12GB and the program took 6m6s to finish. I didn't check those numbers for the version using vectors. The size per entry is for 18 bytes for int-vectors with entries (Cfd and Charge), 14 bytes for int-vectors without entries (Zc, Led, ccShort, and ccLong), 16 bytes for short-vectors with entries (KValue), and 9.4 bytes for the arrays. We might be able to reduce the size further if we use different size for the arrays, e.g. one size for Charge, KValue, and Cfd, one size for Zc, ccShort, and ccLong, and a third size for Led.

VinzenzBildstein commented 8 years ago

BTW, the sizes for members in the tree are as expected, 2 bytes for shorts, 4 bytes for ints (except for the fBits which are 8 bytes?), and 8 bytes for longs.

VinzenzBildstein commented 8 years ago

Since Zc, ccShort, and ccLong are only used in the 4Gs which produce one hit per fragment, I've tested changed these from arrays to simple variables. I've also re-ordered the members in the TFragment to go from largest to smallest (to reduce the amount of alignment padding), but I haven't checked the effect of this yet. With all the above changes the memory consumption processing a 2GB midas file (compressed with bzip) is 8.8 GB and the program takes 5m32s to finish. I've also tested the original TFragment (using vectors), and that one uses 10 GB for the same file and finished in 7m9s. So I went ahead and checked what happens if we used vectors for the charge and so on, but simple variables for the psd-values (Zc, ccShort, and ccLong). In that case the program used only 8.2 GB and finished in 6m23s. The different running times are most likely influenced by the resulting file size (using just vectors gives a 1.6GB file, just arrays 1.7 GB, vectors and int 1.3GB, and arrays and int 1.4GB).

Based on these numbers it seems as though the size of the tree as reported using the TTree::Print() doesn't really influence the memory usage of the program, as the tree size reduces from 7.77 GB to 6.23 GB when replacing the vectors with arrays, but the memory usage increases from 10 GB to 12 GB.

It also seems that vectors aren't very efficient when they are empty (according to http://info.prelert.com/blog/stl-container-memory-usage they use 24 bytes plus the actual heap).

WARNING: I just noticed that the TFragment destructor hasn't been dealing properly with the allocated arrays, i.e. there might have been a memory leak in all test with arrays! I'll correct this and post updated results!

VinzenzBildstein commented 8 years ago

Okay, I've fixed the TFragment destructor, and these are the number I get:

TFragment Members RAM Run Time File Size Tree Size
just vectors (original) 10 GB 7m09s 1.6 GB 7.77 GB
just arrays 9.2 GB 6m28s 1.7 GB 6.23 GB
vectors and int 8.2 GB 6m23s 1.3 GB 6.70 GB
arrays and int 6.4 GB 5m36s 1.4 GB 5.66 GB

So it seems that replacing the vectors with arrays reduces the size of the tree and saves RAM. Replacing the Zc, ccShort, and ccLong vectors with simple ints improves things even further. This is definitely something we should do for ccShort and ccLong, but I've noticed that Zc is also used in Tigress now.

How does the Tigress data look like? Do we expect multiple Zc, Charge, and so on per fragment?

jhenderson88 commented 8 years ago

We shouldn't have multiple charge, etc. per fragment in TIGRESS.

pcbend commented 8 years ago

In tigress data one will only every get one everything per fragment.

As far as memory usage is concerned between arrays and vectors in the root tree. I know this has been a topic for some time that we have gone back and forth on. It seems pretty clear that the memory gains in storing and time gains in compressing/uncompressing are worth it. I however like the idea of having vectors in the classes as the make life easier for analysis. My first gut feeling for this is to make custom streamers to get the best of both worlds. Thoughts?

On Fri, Jan 22, 2016 at 11:37 AM, Vinzenz Bildstein < notifications@github.com> wrote:

Okay, I've fixed the TFragment destructor, and these are the number I get when using just vectors (original): 10 GB RAM, runs for 7m09s, 1.6 GB file, 7.77 GB tree just arrays: 9.2 GB RAM, runs for 6m28s, 1.7 GB file, 6.23 GB tree vectors and int: 8.2 GB RAM, runs for 6m23s, 1.3 GB file, 6.70 GB tree arrays and int: 6.4 GB RAM, runs for 5m36s, 1.4 GB file, 5.66 GB tree

So it seems that replacing the vectors with arrays reduces the size of the tree and saves RAM. Replacing the Zc, ccShort, and ccLong vectors with simple ints improves things even further. This is definitely something we should do for ccShort and ccLong, but I've noticed that Zc is also used in Tigress now.

How does the Tigress data look like? Do we expect multiple Zc, Charge, and so on per fragment?

— Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/652#issuecomment-173971300 .

VinzenzBildstein commented 8 years ago

I don't think that custom streamers help with this, as far as I understand it root writes the vectors as arrays anyway. That is the reason that empty vectors take 12 byte in the tree, because the array (which is a pointer) needs 8 bytes (64bit addressing), and the integer that determines the length of the array needs 4 bytes. The only change we could do in a customer streamer is to use a single integer for multiple arrays, which is essentially what I did when using arrays in the TFragment.

But using arrays also reduces the memory usage of grsisort which is important if we want to run multiple instances (to increase the sorting speed). I think this is due to the fact that vectors have an overhead of 24 bytes compared to the 12 bytes of the array plus integer. So I think that at least for the TFragment we should think about switching to using arrays. We can still keep vectors in the analysis classes, because I agree with you, they make life much easier in the analysis.

If we decide to go this way, the changing of the TFragment would most likely require a re-sorting of all old data as well, unless we write custom streamers that can handle reading the older TFragment data as well as the new. Of course at that point the question whether we want to make all TFragment members private (and use our normal format of starting members with lower case f) becomes relevant again. As far as I remember we decided against that, when the question was originally raised, because it would require re-sorting, but if we need to do that anyway ...

SmithJK commented 8 years ago

If we are going to require re-sorting, we should stick the AcceptedChannelId back into the TFragment.

VinzenzBildstein commented 8 years ago

One note (in case we later want to compare these numbers), this was all done using ROOT 6 where at the moment we also write the fUniqueId and fBits from TObject.

SmithJK commented 8 years ago

The GRIFFIN event format allows for a single fragment to satisfy multiple filter conditions. When this happens, the fragment has multiple Filter Condition Counter Values (Trigger Ids or Master Filters IDs, depending on your preferred name). We currently allot the TriggerId as a Long_t. This will need to be changed to an array or a vector. Probably, based on this discussion, an array.

Regardless, in order to properly access the filter information, we will need to change a TFragment member.

r3dunlop commented 8 years ago

Is that how we are doing it? I feel like if you have a long (or some sufficiently long data type) you can set bits for various trigger conditions. If it isn’t like that in the data, perhaps we can consider doing that during sorting.

On Jan 22, 2016, at 7:31 PM, SmithJK notifications@github.com wrote:

The GRIFFIN event format allows for a single fragment to satisfy multiple filter conditions. When this happens, the fragment has multiple Filter Condition Counter Values (Trigger Ids or Master Filters IDs, depending on your preferred name). We currently allot the TriggerId as a Long_t. This will need to be changed to an array or a vector. Probably, based on this discussion, an array.

Regardless, in order to properly access the filter information, we will need to change a TFragment member.

— Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/652#issuecomment-174097617.

SmithJK commented 8 years ago

What you describe is exactly the Trigger Bit Pattern. Each of the filter conditions has a counter of how many fragments have passed that condition. The current value of that counter is included in any fragment that passes a filter. If multiple conditions are passed, then there are multiple counter values in the fragment.

VinzenzBildstein commented 8 years ago

I'm going to work on the streamer first (trying to get to read old vector and new array data without problems), Once I got that done I'll create a new branch and we can start discussing what else needs to be changed in the fragment.

VinzenzBildstein commented 8 years ago

I don't think that I'll manage to get the streamer to work with old and new fragments. It seems that when reading the old files, the streamer that was written to those files is being used (and fails), instead of the streamer that I wrote.

So the main question is whether we want to change the TFragment class to implement the arrays (and everything else that has been discussed so far)?

r3dunlop commented 8 years ago

I would go for yes, we change the fragment. It's not so bad to re-sort all of your fragments as long as you do it at a time when you don't need them.

On Jan 26, 2016, at 11:41 AM, Vinzenz Bildstein notifications@github.com wrote:

I don't think that I'll manage to get the streamer to work with old and new fragments. It seems that when reading the old files, the streamer that was written to those files is being used (and fails), instead of the streamer that I wrote.

So the main question is whether we want to change the TFragment class to implement the arrays (and everything else that has been discussed so far)?

— Reply to this email directly or view it on GitHub.

SmithJK commented 8 years ago

I am also in favor of just going ahead and changing the fragments.

VinzenzBildstein commented 8 years ago

Okay, since no one has protested such a change, I think we should start and spend some time to discuss everything that needs to be changed. Some things have already been mentioned:

One thing I was thinking about (haven't checked the numbers yet) is that there seem to be a number of variables that are just Tigress (e.g. led), or just griffin (ccShort, ccLong). And since it seems that the size of the fragment is important to a) reduce the file size and such the write speed, and b) to reduce the memory usage, allowing us to run more instances in parallel. The latter is especially important if we ever want to reach the 300 MB/s sorting speed. So I was wondering what people think about creating different fragments (maybe all inheriting from a TFragment that has only a few virtual functions) for Tigress, 8pi, and Griffin? For the latter we might even consider having different fragments, for old and new data, and for data with descant and without. This would be a bit of work to get working right, but it could possibly increase our sorting speed significantly.

VinzenzBildstein commented 8 years ago

After some emails with @pcbend about this, we've decided to just have an old fragment (which can be used for old griffin data), and a new fragment (which can be used with the new griffin data and with tigress data). The reason is that the new griffin data and the tigress data (old and new) don't differ much. The Led apparently isn't in the data anymore and the Zc is only calculated, so we can move that to the analysis step. So I'll go ahead and start working on getting the new fragment to work alongside with the old fragment.

VinzenzBildstein commented 8 years ago

@r3dunlop , there is a comment next to the ChannelNumber member "ryan is going to delete this." Do you know what this is about? Can ChannelNumber be removed?

r3dunlop commented 8 years ago

I was going to “transient it”, or at least perform the calculation every time. It is redundant information which is the combination of the address and the cal file, so it does not need to be there.

On Feb 4, 2016, at 1:10 PM, Vinzenz Bildstein notifications@github.com wrote:

@r3dunlop https://github.com/r3dunlop , there is a comment next to the ChannelNumber member "ryan is going to delete this." Do you know what this is about? Can ChannelNumber be removed?

— Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/652#issuecomment-179973864.

SmithJK commented 8 years ago

@VinzenzBildstein , is the zero-crossing not going to be in the 4G data packet anymore?

VinzenzBildstein commented 8 years ago

The zero-crossing (and ccShort and ccLong) will still be in the 4G data packet, but my plan is to write these to a separate branch (either as a new class or simply as three ints). This branch(es) will only be written if descant is present according to the run info, otherwise the branch(s) will not be there. That way we really have zero overhead w/o descant, and only some overhead if descant is being used. I think that is the best way to reduce the amount of data written (and stored in memory).

VinzenzBildstein commented 8 years ago

@jhenderson88 , I've tested sorting run05759_000.mid with the new fragment scheme (using TOldFragment), and the maximum memory usage was 26 GB. This is definitely an improvement over the 40 GB you saw. The size of the uncompressed fragment tree is 13.1 GB, so it's just using twice that memory, which seems reasonable to me, but to really know that we would need to use valgrind to find out where exactly the memory is being used. With data in the new format this might even be less.

jhenderson88 commented 8 years ago

OK, excellent. That should be much more manageable. I think when we worked it out twice the memory seemed like a reasonable guesstimate, so that sounds about right.

VinzenzBildstein commented 8 years ago

BTW, sorting one of the new format files (run06125_000.mid, 2GB) uses 3 GB of RAM with a tree size of 2GB.

r3dunlop commented 8 years ago

That’s fantastic.

On Mar 1, 2016, at 3:56 PM, Vinzenz Bildstein notifications@github.com wrote:

BTW, sorting one of the new format files (run06125_000.mid, 2GB) uses 3 GB of RAM with a tree size of 2GB.

— Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/652#issuecomment-190900480.

VinzenzBildstein commented 8 years ago

Yes, what worries me is that the ratio of memory to tree size seems to differ. For the old data we need memory twice the size of the tree, for the new data we only need memory 1.5 times the size of the tree. The new data is very simple (either simple variables, or vectors), the old data uses pointers to variable size arrays. This means it's much easier to have a memory leak with the old data. I am investigating this, unfortunately valgrind's memcheck doesn't seem to be a big help.

r3dunlop commented 8 years ago

I believe this is fixed. Can someone confirm and close?