cms-gem-daq-project / reg_utils

0 stars 9 forks source link

Python tools killed on the CTP7 with the address table for 12 OH's #41

Open lpetre-ulb opened 5 years ago

lpetre-ulb commented 5 years ago

When trying to launch the gbt.py tool on the CTP7 with the address table for 12 OH's in order to perform a phase scan, the process is killed.

Brief summary of issue

During the tests of the new CTP7 release (version 3.7.0) with 12 OH's, the python tools on the CTP7 stopped working. Some of the tools can be used from the DAQ machine, but others, such as gbt.py, must currently must be called from the CTP7.

Each tool using the address table is killed because an out-of-memory issue during the pickle file loading : https://github.com/cms-gem-daq-project/reg_utils/blob/7b9cf050c63dae5f3a9b5803389130cff2f039fa/python/reg_interface/common/reg_xml_parser.py#L106-L113

The precise error is the following :

eagle63:~$ gbt.py 0 0 v3b-phase-scan /mnt/persistent/gemdaq/gbt/OHv3b/20180314/GBTX_OHv3b_GBT_0__2018-03-14_FINAL.txt
Open pickled address table if available  /mnt/persistent/gemdaq/xml/gem_amc_top.pickle...
Killed

Types of issue

Expected Behavior

I expected the gbt.py tool to perform a phase scan without any error.

Current Behavior

The gbt.py is currently killed.

eagle63:~$ gbt.py 0 0 v3b-phase-scan /mnt/persistent/gemdaq/gbt/OHv3b/20180314/GBTX_OHv3b_GBT_0__2018-03-14_FINAL.txt
Open pickled address table if available  /mnt/persistent/gemdaq/xml/gem_amc_top.pickle...
Killed

Steps to Reproduce (for bugs)

  1. Connect to a CTP7 with an address table for 12OH's, e.g. ssh gemuser@eagle63
  2. Launch the phase scan command : gbt.py 0 0 v3b-phase-scan /mnt/persistent/gemdaq/gbt/OHv3b/20180314/GBTX_OHv3b_GBT_0__2018-03-14_FINAL.txt
  3. The process is killed.

Possible Solution (for bugs)

Enabling the GC did not help. The gbt.py tools could be refactored to run from the DAQ machine.

Your Environment

Default environment :

TERM=xterm-256color
SHELL=/bin/sh
USER=gemuser
LD_LIBRARY_PATH=:/mnt/persistent/gemdaq/lib:/mnt/persistent/rpcmodules
PATH=/mnt/persistent/gemuser/bin:/mnt/persistent/gemdaq/python/reg_interface:/usr/local/bin:/usr/bin:/bin:/mnt/persistent/gemdaq/scripts:/mnt/persistent/gemdaq/bin
PWD=/mnt/persistent/gemuser
EDITOR=vi
LANG=en_US.UTF-8
TZ=UTC
PS1=\h:\w\$
SHLVL=1
HOME=/mnt/persistent/gemuser
LANGUAGE=en_US.UTF-8
GREP_OPTIONS=--color=auto
LS_OPTIONS=--color=auto
LOGNAME=gemuser
GEM_PATH=/mnt/persistent/gemdaq
_=/usr/bin/env
bdorney commented 5 years ago

Can you try the following:

The from the DAQ machine try to call confChamber.py with --run and --vt1=X for some X not equal to 100. This will use the LMDB. Does the configuration succeed? i.e. does the out of memory error also occur when trying to read the LMDB?

bdorney commented 5 years ago

If there's no issue when using the LMDB then this means we need to either:

  1. Understand memory limits and if it's possible to reduce the pickle file size, or
  2. Migrate the python tools on the CTP7 need to be migrated to dedicated rpcmodules.

We cannot increase the memory of the card.

lpetre-ulb commented 5 years ago

So, I tried to revert the address table, pickle file and the LMDB to the 4 OH's case. In that case, everything works as expected : python tools on the CTP7 are not killed and the chamber can be configured from the DAQ machine with the confChamber.py script.

When coming back to the 12 OH's case, the issue is back. The python tools on the CTP7 are killed, but the confChamber.py succeed. The LMDB does not create out-of-memory issue.

I'll investigate the memory limits more carefully, but here are a few informations :

I think it is possible to reduce the pickle file size which is sent to the CTP7, but that would require to create a lightened Node python class. That might not be the best solution... Migrating to dedicated rpcmodules looks more future proof.

Anyway, I'll try to understand what is the limit on the node number/pickle file size on the CTP7.

lpetre-ulb commented 5 years ago

The reported size of the OrderedDict nodes, by pympler asizeof module, is ~429 MiB, too much to fit on the CTP7.

Reducing the size of the `Node´ class seems a waste of time and would lead to the maintenance of two address tables. It would be better to move to RPC modules. See this issue for following up.

mexanick commented 4 years ago

I guess we want this eventually https://lmdb.readthedocs.io/en/release/

lpetre-ulb commented 4 years ago

Instead of packaging an external Python package for the CTP7 and redeveloping the register parsing code, I would more simply write a small Python wrapper (boost::python or pybind11) around the few useful functions in our code (readReg/writeReg).