JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
7 stars 9 forks source link

Crash at the end of REST file processing with Jana2 #834

Open aaust opened 3 months ago

aaust commented 3 months ago

Reported by @zihlmann

I tried to use JANA2 on the ifarm using version_5.19.1_jana2.xml and I looked at CPP/NPP data REST files. To me it seems to always crash at the end or close to the end. I tried 2 files and in both cases the crash is around event # 2million. So it looks like at the end of the file.

The command I used was: hd_root -PLUGINS monitoring_hists /cache/halld/RunPeriod-2022-05/recon/ver01/REST/101534/dana_rest_101534_000.hddm

aaust commented 3 months ago

I just built a new version set version_5.19.2_jana2.xml with Raiqa's fix. The hd_root process seems to end correctly at the end of the file, but the output from monitoring_hists looks very different from the Jana1 output. Somehow, I also get all these error messages:

DL1MCTrigger_factory:  This program MUST be used with an HDDM file as input!
   Default seeds will be used for the random generator

Beni @zihlmann , could you please run the test again?

RaiqaRasool commented 3 months ago

The difference in results between JANA1 and JANA2 in monitoring_hists is expected due to the changes in how the Get function behaves in JANA2 compared to JANA1.

Screenshot

If the level of discrepancy you've observed is somewhat the same as it is shown in the image above then it is indeed due to the updated Get function. To achieve consistent results with what produced in JANA1, you'll need to build halld_recon against the rasool_jana2_mimicking_jana1 branch rather than the rasool_jana2 branch.

Regarding the DL1MCTrigger_factory error you mentioned, I am not sure why you are getting that. I ran the monitoring_hists plugin using the following command and it ran without any errors for me:

hd_root /cache/halld/RunPeriod-2023-01/rawdata/Run121120/hd_rawdata_121120_000.evio -PPLUGINS=monitoring_hists

Could you please share the exact command you used? This would help me reproduce the issue and investigate further.

zihlmann commented 2 months ago

using the new xml file (jana2 15.19.2) the command: hd_root -PLUGINS=monitoring_hists /cache/halld/RunPeriod-2022-05/recon/ver01/REST/101534/dana_rest_101534_000.hddm

does run through without error. However, the hd_root.root output file is empty.

aaust commented 2 months ago

Sorry, there was one P missing: hd_root -PPLUGINS=monitoring_hists /cache/halld/RunPeriod-2022-05/recon/ver01/REST/101534/dana_rest_101534_000.hddm

nsjarvis commented 2 months ago

Referencing this: https://github.com/JeffersonLab/halld_recon/issues/435

zihlmann commented 2 months ago

ups, i should have seen that myself. now I get the error similar to you:

DL1MCTrigger_factory: This program MUST be used with an HDDM file as input! Default seeds will be used for the random generator

on another note: I still get the same error when trying to compile a new plugin under JANA2 with scons [zihlmann@ifarm2401 TESTJANA2]$ scons scons: Reading SConscript files ... scons: done reading SConscript files. scons: Building targets ... Compiling [JEventProcessor_TESTJANA2.cc] In file included from JEventProcessor_TESTJANA2.cc:8: JEventProcessor_TESTJANA2.h:13:61: error: expected class-name before ‘{’ token 13 | class JEventProcessor_TESTJANA2:public jana::JEventProcessor{ | ^ JEventProcessor_TESTJANA2.h:20:17: error: ‘jerror_t’ does not name a type; did you mean ‘error_t’? 20 | jerror_t init(void); ///< Called once at program start. | ^~~~ | error_t JEventProcessor_TESTJANA2.h:21:17: error: ‘jerror_t’ does not name a type; did you mean ‘error_t’? 21 | jerror_t brun(jana::JEventLoop *eventLoop, int32_t runnumber); ///< Called everytime a new run number is detected. | ^~~~ | error_t

RaiqaRasool commented 2 months ago

The tests I performed before were using .evio files as input to monitoring_hists and I never got this error during those tests. However, I encountered the error mentioned above when running it with an .hddm file, so I’ll need to investigate this further. Thanks for sharing the command.

As for the error you're seeing when trying to compile a new plugin under JANA2 with scons, this is happening because the mkplugin scripts haven’t been updated yet for JANA2. We're in the final stages of updating them, but we’re holding off on completing this until we finalize the documentation on the syntax changes from JANA1 to JANA2. This way, we can ensure any additional updates required for the code generated by these scripts are included. I’ll do my best to have this ready before the next GlueX meeting. I apologize for the inconvenience.

aaust commented 2 months ago

Thank you! I repeated your test with hd_rawdata_121120_000.evio. I processed the file with 12 threads with JANA1 (-PNTHREADS=12) which resulted in 120Hz. On the other hand, I was not able to use JANA2 in multithreaded mode so I only got a 14Hz average.

aaust commented 2 months ago

Here is a link to a pdf of all monitoring histograms for Jana1 and Jana2 for this raw data file: https://halldweb.jlab.org/talks/2024/jana2_comparison_independent_121120_000.pdf Besides minor differences in the pulls (pages 96-113), I see a few other issues:

Are these differences that are expected due to the change of the Get() function? Some of the look quite drastic.

RaiqaRasool commented 2 months ago

Thank you very much for sharing the results in PDF format. Would it be possible for you to build a version of JANA2 that mimics JANA1 in terms of Get() function behavior by using rasool_jana2_mimicking_jana1 branch instead of rasool_jana2? If feasible, could you also generate the same set of results you provided in the PDF using that version? This would help us quickly identify whether the observed differences are solely due to the Get() function.

I can certainly handle this myself, but I don't have the script you used to generate these results. If it wouldn’t be too much trouble, could you assist with this?

In the meantime, I’m actively investigating the issue with:

DL1MCTrigger_factory: This program MUST be used with an HDDM file as input!
Default seeds will be used for the random generator

and I'm close to pinpointing the cause. Additionally, I'm also looking into why multithreading isn't functioning properly for .evio files in the context of monitoring histograms.

sdobbs commented 2 months ago

For what it’s worth, the DL1MCTrigger error message was addressed in the main branch with a couple different PRs over the last few months. I’m not sure when you last brought those changes over, but that could be something to look at as well.

On Fri, Sep 6, 2024 at 8:05 AM Raiqa Rasool @.***> wrote:

Thank you very much for sharing the results in PDF format. Would it be possible for you to build a version of JANA2 that mimics JANA1 in terms of Get() function behavior by using rasool_jana2_mimicking_jana1 branch instead of rasool_jana2? If feasible, could you also generate the same set of results you provided in the PDF using that version? This would help us quickly identify whether the observed differences are solely due to the Get() function.

I can certainly handle this myself, but I don't have the script you used to generate these results. If it wouldn’t be too much trouble, could you assist with this?

In the meantime, I’m actively investigating the issue with:

DL1MCTrigger_factory: This program MUST be used with an HDDM file as input! Default seeds will be used for the random generator

and I'm close to pinpointing the cause. Additionally, I'm also looking into why multithreading isn't functioning properly for .evio files in the context of monitoring histograms.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/834#issuecomment-2333907243, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJAS2XBFDFKF3MESVCUKX3ZVGK7XAVCNFSM6AAAAABNM5IEY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZTHEYDOMRUGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RaiqaRasool commented 2 months ago

Thanks for pointing that out, Sean. I searched for DL1MCTrigger in the PRs and commits on the master branch but couldn't find anything. Could you share the specific PRs where this was addressed?

I last rebased the rasool_jana2 branch with the master branch on August 12th.

sdobbs commented 2 months ago

OK, I think the relevant changes are to DTrigger_factory.cc, since that should be the only place we get the DL1MCTrigger objects for raw data.

I replaced this logic with a different factory in this PR - https://github.com/JeffersonLab/halld_recon/pull/828#issuecomment-2289411561 so if that's not in your branch, you might want to merge this in first.

---Sean

On Fri, Sep 6, 2024 at 8:42 AM Raiqa Rasool @.***> wrote:

Thanks for pointing that out, Sean. I searched for DL1MCTrigger in the PRs and commits on the master branch but couldn't find anything. Could you share the specific PRs where this was addressed?

I last rebased the rasool_jana2 branch with the master branch on August 12th.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/834#issuecomment-2333968150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJAS2SSQLHWPXDJFGWEU4TZVGPKXAVCNFSM6AAAAABNM5IEY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZTHE3DQMJVGA . You are receiving this because you commented.Message ID: @.***>

RaiqaRasool commented 2 months ago

Thanks, Sean. I reviewed the PR and found the changes made in DTrigger_factory.cc.

aaust commented 2 months ago

@RaiqaRasool I repeated the comparison with the branch rasool_jana2_mimicking_jana1 (version_5.19.3_jana2.xml). Indeed, most differences disappeared or became very small. Only the difference in the dead zone of the 2-dimensional TOF matching histograms (pages 161, 163, 230, 231) remain striking. It could still be that your halld_recon branch is not 100% aligned with what I was using for comparison (https://github.com/jeffersonlab/halld_recon/releases/tag/4.49.0), so I don't worry too much about it. Here is a link to the pdf file: https://halldweb.jlab.org/talks/2024/jana2_comparison_independent_121120_000_mimicking_jana1.pdf