R3BRootGroup / R3BRoot

Framework for Simulations and Data Analysis of R3B Experiment
https://github.com/R3BRootGroup/R3BRoot/wiki
GNU General Public License v3.0
18 stars 104 forks source link

[BIG UPDATE] Package manager, build.py, data IO using STL and Neuland time calibration rewrite #890

Closed YanzhaoW closed 3 months ago

YanzhaoW commented 1 year ago

This PR contains lots of new features. Here is the list:

  1. Implementing Conan package manager for R3BRoot. Just like Pip or Conda for python, Conan is the most popular package manager for C++ project. It has seamless integration with cmake without the need to setup any environment variables. Its main difference to spack (which is used by FairRoot) is Conan is a project-wise package manager, which means different projects could have different packages without any interference between each other. To add a library for the project, just add the name and version to the conanfile.txt in the root directory. Currently there are 4 libraries being added:
    • ms-gsl/4.0.0
    • range-v3/0.12.0
    • gtest/cci.20210126
    • nlohmann_json/3.11.2
  2. Adding a build script build.py to manage all the build processes for R3BRoot. To install the conan packages, use ./build.py -p. To configure cmake, use ./build.py -c. And to build the project, use ./build.py -b. To do all: ./build.py -a. For more specification, check ./build.py -h
  3. Completely getting rid of TClonesArray for data input and output. All the data read or written with ROOT TTree has STL types. For example, Neuland mapped data is stored as std::vector<R3B::PaddleTamexMappedData> and trigger mapped data is stored as std::map<unsigned int, R3B::PaddleTamexTrigMappedData>.
  4. Add an algorithm to automatically detect Neuland TrigIDMap. It can later be saved into a JSON file. If disabled, the TrigIDMap need to be read from a JSON file. JSON file reading and writing is implemented using nlohmann's JSON library.
  5. No usage of index arithmetic for data analysis. Rather, use range-v3 library to construct the data binding and data filtering.
  6. Following classes are rewritten as new classes (with type aliases):
    • R3BPaddleTamexMappedData -> R3B::PaddleTamexMappedData(R3BPaddleTamexMappedData2)
    • R3BTCalPar -> R3B::TCalPar (R3BTCalPar2)
    • R3BNeulandTamexReader -> R3BNeulandTamexReader2
    • R3BNeulandMapped2CalPar -> R3B::Neuland::Mapped2CalPar(R3BNeulandMapped2CalPar2)
    • R3BNeulandMapped2Cal -> R3B::Neuland::Mapped2Cal (R3BNeulandMapped2Cal2)
  7. Place all the new implementation in a new folder neuland/calibration/calLevel/
  8. Nearly all old code stay untouched.

TODO:

It's no hurry to merge this PR to the dev branch. Feel free the leave comments below if there are any suggestions. In the meanwhile, I will try to finish the energy calibration rewriting in few weeks.


Checklist:

YanzhaoW commented 1 year ago

Hi @kresan, could you give this PR a review about the build.py, a build script for R3BRoot, and using parameters without parameter factories?

klenze commented 1 year ago

Package Manager

From my understanding, the conan dependencies would not be automatically installed if people continued to run cmake, which is why you added the build.py.

I would be reluctant to wrap the top level of the build process. cmake and make both take a zillion optional flags, so it is worthwhile for users to know how to run them manually so that they have the option to pass some of these flags.

I am also doubtful that our lead developers will be easily sold on adding an extra step in the build process. Fetching external packages from cmake, while not particularly elegant, seems more likely to succeed.

External dependencies

One of the main benefits of add external dependencies in cmake is that it is painful, which dissuades people from adding to many. I don't know much about C++ json libraries (I am more of a yaml fan), but it seems to me that we already get a json lib with boost. I have not used boost json, it could be an utter horror to work with for all I know. But if it is reasonable, I would much prefer to use a boost library to an external dependency, even if it is less shiny. The reason is that someone might have to compile this code in a decade, and I am not sure if nlohmann_json or conan will still be maintained by then. (If boost is not maintained by then, that is FairSoft's problem).

With gtest, my understanding is that FairSoft ships with a version, but you would like a more recent version?

Class Design I had a look at MappedDataReader. I appreciate you trying to separate the ugly FairRoot glue code from code which does actual useful work, but I don't think I am convinced your model can scale into a general solution to the problem of handling FairRootManager. (Of course, you never claimed that, so this is a bit like "I just bought some groceries" - "You fool, that will never scale to solve world hunger".)

In theory, that class could be templated over MappedData to generate a reader for any r3b data container. However, this won't work to well. The main problem is that as soon as you need a task which reads from two different data containers, you would need multiple inheritance. Of course, an instance of Reader<T> can not provide a Method Get##T, because only the preprocessor can concat token strings, so it would at most provide a method T GetContainer(T* dummy). This in turn will fail if a class needs two instances (with different runtime string names in FRM) of the same data type, say WRTS from both MainDAQ and Neuland. I guess if one found a way to smuggle strings into template parameters (lambda expressions do work), one could use template packs to do something like

class MainNeulandWRTSDifference: 
public TemplateReader<DataInstance<WRTSData, "MainWRTS">, DataInstance<WRTSData, "NeulandWRTS">> 
{
void Exec(...)
{
   auto mainWrts=GetContainer<DataInstance<WRTSData, "MainWRTS">>();
   auto mainWrts=GetContainer<DataInstance<WRTSData, "NeulandWRTS">>();
   ...
}
};

What I would instead prefer is the following:

class MainNeulandWRTSDifference : public BetterTask
{
    auto fWrtsMain=FetchData<WRTSData>("MainWrts");
    auto fWrtsNL=FetchData<WRTSData>("MainWrts");
    auto fDiff=RegisterOutput<WrtsDifferenceData>("MainNLWrtsDiff");
    auto fSomePar=FetchParameter<WrtsClock>("WrtsClock");
...
};

Because the objects can not actually be fetched when the class members are initialized, accessing them will involve an additional level of indirection. (The alternative is the ugly syntax of auto fWrtsMain=FetchData<WRTSData>(fWrtsMain, "MainWrts"); which would allow BetterTask to store a reference to fWrtsMain and assign it at an appropriate time.)

The implementation of BetterTask::FetchData will of course be a bit tricky. I would use a

std::vector<std::function<void(FairRootManager*)> fOnInit;
template<class T>
T* FetchData(T*& target std::stringview name)
{
    fOnInit.push_back(
       [name&, target&](FairRootManager* frm)
       {
           target=frm->InitObjectAs<T>(name);
           // error handling
       });
    return nullptr;
}

(Ugly fast version without indirect reference, the other version would require more memory storage. Of course, it is probably possible to have a simple pointer-wrapping class which overloads operator= and copy constructor and tells BetterTask where the pointer it needs to update is stored. This would be actually neatly byzantine.)

Perhaps I will get around to writing a proof of concept for that at some point.

YanzhaoW commented 1 year ago

From my understanding, the conan dependencies would not be automatically installed if people continued to run cmake, which is why you added the build.py.

No really. The conan dependencies are only needed to be installed for the first time with:

conan install . --output-folder=build --build=missing

And then when you run with cmake, you need to add a cmake flag to let cmake know where are packages:

cmake ..  -DCMAKE_TOOLCHAIN_FILE=Dir/conan_toolchain.cmake

For the second run or third run, you could just run cmake .. since the package path is cached already. This is what is really nice about conan: it's not intrusive. You don't need to add or change anything in your cmake script to use conan. If conan is not support anymore (which is highly unlikely), we could just use another one without changing cmake script. The reason why there is a build.py is that I want to simplify the process when the project is built for first time. Actually it's also very simple after the first time build.

But you are right about this point. Use ./build.py -a is different from the traditional workflow mkdir build && cd build && cmake .. && make. However, a build script for a large C++ project is indeed very common. Some people choose make script, some choose bash script and some choose python script. All is just to make build process user friendly.

That being said, there is an ongoing development for conan2 (see this), which allow cmake automatically install all conan packages and cache the path. It's already done for conan1 and now they are trying to make it compatible with conan2.

I am also doubtful that our lead developers will be easily sold on adding an extra step in the build process. Fetching external packages from cmake, while not particularly elegant, seems more likely to succeed.

I wouldn't say they are fetched from cmake. There is a Fetch content function in cmake, which everyone in C++ community is complaining about. Conan, on the other hand, is similar to Spack they are using for FairSoft. The only difference is Conan is much much nicer to integrate with cmake than Spack. With Spack, you basically need to set the PATH env variable and tell cmake to go to find the Config.cmake with those paths. If someone set PATH differently, then it's screwed. But with Conan, cmake just need to use conan_toolchain.cmake, which specifies the package paths and the cmake targets directly. It's hardly surprising since Spack is more like dnf or apt which can install packages in many languages systemwide. Conan is the package manager only for C++.

I don't know much about C++ json libraries (I am more of a yaml fan), but it seems to me that we already get a json lib with boost. I have not used boost json, it could be an utter horror to work with for all I know.

Yes, I heard about boost JSON library. There is also another JSON library called simdjson. Among these three libraries, I personally think nlohmann_json is probably most easy to use. It has super nice API and the json data structure is much like any STL containers. Maybe I will also try boost.json. Since with a package manager, changing into a different library is very easy.

I wound't say yaml is very suitable for storing data. It's more like for configuration, instead of data storage. Actually I tried to use yaml to configure neuland executable using yaml-cpp (see here). But since then I just keep thinking: maybe python or lua is better to configure C++ project? Anyway, C++ JSON library is much more mature than yaml library and if you see this reddit post, most people use JSON.

YanzhaoW commented 1 year ago

I appreciate you trying to separate the ugly FairRoot glue code from code which does actual useful work, but I don't think I am convinced your model can scale into a general solution to the problem of handling FairRootManager.

Well, there wasn't MappedDataReader in the beginning. Later I found there are a lot of duplicated code between R3BNeulandMapped2Cal and R3BNeulandMapped2CalPar. Both need to read the mapped data, one for the calculation of parameters and the other for calculation of cal level data. Trying to follow the best practise, duplication is "evil". So in the end I just created a class to eliminate all duplications. This class is definitely not to solve the boilerplate problem from FairRootManager.

Because the objects can not actually be fetched when the class members are initialized, accessing them will involve an additional level of indirection.

I don't know how data is read from ucesb. But with TTree, all it does in each loop is to change the value of the variable, that has been already initialised. I have no idea why it's so complicated with FairRootManger ( still need some time to figure out what's going on there).

class MainNeulandWRTSDifference : public BetterTask { auto fWrtsMain=FetchData("MainWrts"); auto fWrtsNL=FetchData("MainWrts"); auto fDiff=RegisterOutput("MainNLWrtsDiff"); auto fSomePar=FetchParameter("WrtsClock"); ... };

I think this way is the best. The only change I would make is to assign the fetched data to a member variable rather than local variable. And I also think it's best to just let task own their own fetched data, instead of registering all the data somewhere else and delete them in the end.

klenze commented 1 year ago

RE neatly byzantine: https://gist.github.com/klenze/34907b55b9bcf1b1401a9af1005a5064 (I would consider this moderately evil. Real evilness would be to rely on return value optimization). Also, I have decided that FairerTask is a nicer name than BetterTask :-)

kresan commented 1 year ago

I understand what Conan package manager can do, but why do you need this for R3BRoot? To handle 4 dependencies?

Concerning the container factories, can you boil down your 3.400 lines of changes to the short description of how do you access parameter containers without factories?

YanzhaoW commented 1 year ago

I understand what Conan package manager can do, but why do you need this for R3BRoot? To handle 4 dependencies?

I would say to handle dependencies easier and more flexible. It's very necessary to have a package manager for the project itself rather than waiting for the library updates from upstream (in our case, FairSoft). If we have a package manager locally, using a new library becomes super easy. Just add a line in conanfile.txt and it's done immediately. The other option is wait for FairSoft to add the library, which could take quite a long time.

Concerning the container factories, can you boil down your 3.400 lines of changes to the short description of how do you access parameter containers without factories?

Sure, please look at the function MappedDataReader::SetParContainers. Here I just use addContainer from FairRuntimeDb (sorry I actually meant FairRuntimeDb when I said FairRootManager previously):

if (auto* rtdb = FairRuntimeDb::instance(); rtdb != nullptr)
        {
            calibrationPar_ = std::make_unique<TCalPar>(calibrationParName_.c_str()).release();
            calibrationTrigPar_ = std::make_unique<TCalPar>(calibrationTrigParName_.c_str()).release();
            rtdb->addContainer(calibrationPar_);
            rtdb->addContainer(calibrationTrigPar_);
            if (calibrationPar_ == nullptr and calibrationTrigPar_ == nullptr)
            {
                throw R3B::runtime_error("calibration parameters are still nullptr!");
            }
            SetExtraPar(rtdb);
        }

Here is the function definition from addContainer:

Bool_t FairRuntimeDb::addContainer(FairParSet* container)
{
    auto name = container->GetName();

    if (!containerList->FindObject(name)) {
        containerList->Add(container);
        TIter next(runs);
        FairRtdbRun* run;
        FairParVersion* vers;
        while ((run = static_cast<FairRtdbRun*>(next()))) {
            if (!run->getParVersion(name)) {
                vers = new FairParVersion(name);
                run->addParVersion(vers);
            }
        }
        // cout << "-I- RTDB entries in list# " <<  containerList->GetEntries() <<"\n" ;

        return kTRUE;
    }

    Warning("addContainer(FairParSet*)", "Container %s already exists!", name);
    return kFALSE;

}

Here you could see it has nothing to do with containerFactory. All it does is to add a container to the containList.

And what we usually do with a parameter (a container) is to use another function called getContainer(), which internally checks the name from the container factories, create the container and call addContainer(). Of course, in this case, the real name of the container is not the string we give to getContainer but concatenated with another string called "context". I certainly don't think this is completely useless. But in many cases when people just need to create a single instance of a parameter, using a container (aka a parameter) factory is unnecessary.

klenze commented 1 year ago

Re FairerTask, it turns out that C++ disallows auto for member variables. I have no idea why. The best I can do is

Ptr<const R3BCalifaCrystalCalData::container_t>  fCalifaMappedData=AddInput(fInputName);

Even that is tricky. You can not match template match on return types. However, you are allowed to have a class which defines a templated type conversion operator. Through that operator one can learn the desired return type.

This is of course a hack. I don't think there is a very neat solution: Either you replace the variable declaration by a preprocessor call, or duplicate the type declaration, or abuse type conversion operators. What C++ generally lacks is a high level metalanguage. The preprocessor (which works on strings) is terrible. I think Rust is better in this regard.

YanzhaoW commented 1 year ago

Ptr<const R3BCalifaCrystalCalData::container_t> fCalifaMappedData=AddInput(fInputName);

Maybe you could default initialise variable first and then AddInput to FairRootManager in Init() function:

private:
    Ptr<const R3BCalifaCrystalCalData::container_t>  fCalifaMappedData;
auto Init()
{
    AddInput(fInputName, fCalifaMappedData);
}

With this the type read from the tree can be deduced by the input parameter type.

Or better way could be:

private:
    Ptr<const R3BCalifaCrystalCalData::container_t>  fCalifaMappedData = "default_input_string";
auto Init()
{
    fCalifaMappedData.Init();
}
kresan commented 1 year ago

@YanzhaoW Sure, if you create parameter containers and add them to the list yourself, then you do not need container factory.

If you decide to handle external software within R3BRoot and to (partially) replace FairSoft with a package manager, you (R3B) will need to maintain it.

YanzhaoW commented 3 months ago

This PR will be divided into multiple PRs with smaller sizes.