Excessive memory consumption

mexanick commented 5 years ago

Brief summary of issue

Current version of gem-light-dqm consumes all available RAM + very large swap space, totally amounting around 13GB. Initial suspicion that we have a memory leak somewhere appears to be incorrect. In fact, such large memory consumption is not a bug, but a "feature".

According to R. Brun, a size of a histogram in memory equals to: sizeof(TObject) + sizeof(Title) + sizeof(Name) + sizeof(Type_t)*(n_bins+2) where TObject is a histogram type (e.g. TH1D) and Type_t is its data type (e.g. Double_t for TH1D). I have calculated the memory size for VFAT histograms and got following:

root [5] s = sizeof(TH1D) + (4096+2)*sizeof(Double_t)+sizeof("BC")+sizeof("Bunch Crossing Number")
(int) 33809
root [8] int vfatH_size=0
(int) 0
root [10] vfatH_size +=s
(int) 33809
root [11] s = sizeof(TH1F)+sizeof("n_hits_per_event")+sizeof("n_hits_per_event")+sizeof(Float_t)*131
(int) 1558
root [12] vfatH_size +=s
(int) 35367
root [13] s = sizeof(TH1F)+sizeof("EC")+sizeof("Event Counter")+sizeof(Float_t)*257
(int) 2045
root [14] vfatH_size +=s
(int) 37412
root [15] s = sizeof(TH1F)+sizeof("Header")+sizeof("Header")+sizeof(Float_t)*34
(int) 1150
root [16] vfatH_size +=s
(int) 38562
root [17] s = sizeof(TH1F)+sizeof("SlotN")+sizeof("Slot Number")+sizeof(Float_t)*26
(int) 1122
root [18] vfatH_size +=s
(int) 39684
root [19] s = sizeof(TH1F)+sizeof("FiredChannels")+sizeof("FiredChannels")+sizeof(Float_t)*130
(int) 1548
root [20] vfatH_size +=s
(int) 41232
root [21] s = sizeof(TH1F)+sizeof("FiredStrips")+sizeof("FiredStrips")+sizeof(Float_t)*130
(int) 1544
root [22] vfatH_size +=s
(int) 42776
root [23] s = sizeof(TH1F)+sizeof("crc")+sizeof("check sum value")+sizeof(Float_t)*65537
(int) 263168
root [24] vfatH_size +=s
(int) 305944
root [25] s = sizeof(TH1F)+sizeof("crc_calc")+sizeof("check sum value recalculated")+sizeof(Float_t)*65537
(int) 263186
root [26] vfatH_size +=s
(int) 569130
root [27] s = sizeof(TH1F)+sizeof("crc_difference")+sizeof("difference between crc and crc_calc")+sizeof(Float_t)*65537
(int) 263199
root [28] vfatH_size +=s
(int) 832329
root [29] s = sizeof(TH1D)+sizeof("latencyScan")+sizeof("Latency Scan")+sizeof(Double_t)*258
(int) 3089
root [30] vfatH_size +=s
(int) 835418
root [31] s = sizeof(TH1D)+sizeof("latencyBXdiffScan")+sizeof("Latency Scan BX subtracted")+sizeof(Double_t)*4354
(int) 35877
root [32] vfatH_size +=s
(int) 871295
root [33] s = sizeof(TH2F)+sizeof("latencyScan2D")+sizeof("Latency Scan: Chan Vs Latency")+sizeof(Float_t)*1026*130
(int) 534596
root [34] vfatH_size +=s
(int) 1405891
root [35] s = sizeof(TH2D)+sizeof("latencyScanBX2D")+sizeof("Latency Scan vs BX")+sizeof(Double_t)*258*4098
(int) 8459339
root [36] vfatH_size +=s
(int) 9865230
root [37] s = sizeof(TH2F)+sizeof("latencyScanBX2D_extraHighOcc")+sizeof("Latency Scan vs BX when number of fired channels is greater than 100")+sizeof(Float_t)*258*4098
(int) 4230266
root [38] vfatH_size +=s
(int) 14095496
root [39] s = sizeof(TH1F)+sizeof("thresholdScan")+sizeof("Threshold Scan")+sizeof(Float_t)*258
(int) 2061
root [40] vfatH_size +=s
(int) 14097557
root [41] s = sizeof(TH2F)+sizeof("thresholdScan2D")+sizeof("Threshold Scan")+sizeof(Float_t)*258*130
(int) 135223
root [42] vfatH_size +=s
(int) 14232780
root [43] s = sizeof(TH1F)+sizeof("thresholdScanXXX")+sizeof("Threshold Scan XXX")+sizeof(Float_t)*258
(int) 2068
root [44] vfatH_size += s*128
(int) 14497484

which gives ~14.5 MB per single VFAT. For a current system of 3 AMC's assuming all links are active, it gives ~14.52412*3 = 12.5GB of memory required - and this doesn't account for GEB, AMC and AMC13 histograms.

Types of issue

[ ] Bug report (report an issue with the code)
[x] Architecture issue (report an issue with the code)
[ ] Feature request (request for change which adds functionality)

Expected Behavior

The application should not consume all available RAM

Current Behavior

Consumes about 7GB of RAM + 6GB of swap.

Steps to Reproduce (for bugs)

Just run the tool and look at the output of top command

Possible Solution

Such huge memory consumption edges the limits of our QC8 PC and leads to crashes in the dqm code. This is caused by improper application of the light dqm tool which has been initially developed as an expert tool for small systems debugging. As a short time solution I propose to strip down a number of histograms. Mainly VFAT ones: two-dimensional latency scans and per-channel threshold scan distributions plus some others. This will allow to run the tool with fairly modest RAM consumption (3-4% of total RAM) and make the application much more stable. Of course, a central online DQM has to be implemented in the near future.

Your Environment

Version used: develop
Shell used: bash

bdorney commented 5 years ago

Ah "oops" I guess is all I can say. Is this resolved by #33?

bdorney commented 5 years ago

@mexanick and @jsturdy I was on holiday when #33 was merged in but now that latencyScan2D it causes problems for the latency scan tool anaXDAQLatency.py and this was a useful histogram, see:

https://github.com/cms-gem-daq-project/gem-plotting-tools/blob/953bd596a19580f0fc34f4fcf6728ba1bddc2645/anaXDAQLatency.py#L157-L159

Is there opposition to adding this histogram back? Was there, perhaps not, a commit that could be reverted to restore just this histogram.

I think we have the memory for this histogram...? The threshold histograms I agree we should remove since right now there's no mechanism that could even fill them (no scan application). Also the other latency histograms were note being used.

mexanick commented 5 years ago

@bdorney this can be retrofitted. Not possible to add via the cherrypick I think, but it is easy to do manually. Just to do it in one go: could you please take a good look on the current histogramming content and let me know which ones you want to add back? Also consider data-types reduction where appropriate (e.g. float instead of double, int instead of float etc)

cms-gem-daq-project / gem-light-dqm