JeffersonLab / sim-recon

Simulation and Reconstruction for GlueX
9 stars 14 forks source link

Tracking error under CentOS7 #591

Open sdobbs opened 8 years ago

sdobbs commented 8 years ago

I tried running monitoring_hists for one file on the new CentOS7 machine, ifarm1402, and one error was printed to the screen that was not printed when running on CentOS6

File: /cache/halld/RunPeriod-2016-02/rawdata/Run011529/hd_rawdata_011529_057.evio

Error: libraries/TRACKING/DTrackWireBased_factory.cc:307 Invalid seed data for event 54793166...

sdobbs commented 8 years ago

Attached is the output from some of the monitoring_hists histograms, where the black line is CentOS 6 results and the blue dashed is CentOS 7. Differences between the two are also shown. It looks like we are losing some tracks...

centos67_comp.pdf

pmattjlab commented 8 years ago

Hey Sean, I recently added a new histogram: NumFDCPseudoHits. It's in your doc for CentOS 6 (page 63), but not CentOS 7 (nor is there the difference). Can you produce those plots? It's really important because:

1) # CDC hits and CDC track candidates are identical. 2) # FDC wire & cathode hits are the same. 3) # FDC track candidates are different.

So it'd be interesting to see if the issue is with the pseudos. If so, are you sure you're using the same constants for each? (Probably are, just checking though).

pmattjlab commented 8 years ago

Oh, and no, we are actually GAINING tracks on CentOS 7. There are fewer events with zero tracks, and more events with more tracks (e.g. page 8).

sdobbs commented 8 years ago

The calibrations are the same, but let me redo this comparison with the latest sim-recon to be sure...

On Wed, Oct 19, 2016 at 2:50 PM Paul Mattione notifications@github.com wrote:

Oh, and no, we are actually GAINING tracks on CentOS 7. There are fewer events with zero tracks, and more events with more tracks (e.g. page 8).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/sim-recon/issues/591#issuecomment-254921259, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIJamkXS7ctKs7zEYWb22qqlW8OfGlrks5q1nSWgaJpZM4KYGiB .

sdobbs commented 8 years ago

I redid this check with the latest sim-recon, didn't see the tracking error under CentOS 7, and am getting comparable results for both OS versions. Looks like things got out of sync before. So I think this looks good? centos67_comp_v2.pdf

pmattjlab commented 8 years ago

So, this looks an order of magnitude better, but still not identical. What's weird is that for CDC track candidates, they were identical in your last study (slide 26 & 28), but they are different in your new study.

But, not only are they different, but it looks like there's a different # of events between centos 6 & 7. They agree everywhere except at the low-#-tracks bins, where centos 6 is showing fewer tracks. Since these histograms are filled once per event, that means that there must be a different # entries for the two tests.

Can you confirm this?

sdobbs commented 8 years ago

Well, this is odd. Both jobs report the same number of EVIO blocks read, but slightly different number of events processed in JANA.

Number of events: CentOS 6: command line: 960192 monitoring_hists/IsEvent: 908804 CentOS 7: command line: 960123 monitoring_hists/IsEvent: 908842

The differences in the number of events on the command line and in monitoring_hists is presumably the number of "special" events (e.g. EPICS).

@faustus123, maybe we are losing some events in the parsing stage?

Some other diagnositics:

CentOS6

EVIO Processing rate = 92.6819 Hz NDISPATCHER_STALLED = 9567087 (92.3%) NPARSER_STALLED = 12061830 (58.2%) NEVENTBUFF_STALLED = 2 ( 0.0%)

CentOS7

EVIO Processing rate = 362.049 Hz NDISPATCHER_STALLED = 2482786 (93.6%) NPARSER_STALLED = 159191 ( 3.0%) NEVENTBUFF_STALLED = 407923 (15.4%)

faustus123 commented 8 years ago

I did a check on this earlier this evening using the latest master. I just ran the online_occupancy plugin and I saw some small differences between the centos6 and centos7 results. I then re-ran on centos6 and saw that was also different from the first run on centos6. The difference was smaller than that between centos6 and centos7, but still, it wasn't zero.

This is almost certainly either a race condition or an ordering issue similar to those I tracked down last summer while working on the new parser. The different compilers used on the different OSes is probably the reason the centos6-centos7 discrepancy is larger than the first and second centos6 runs. These are tedious and difficult problems to track down so unless someone has a great idea where to look, it's going to take some serious digging.

Any volunteers?

faustus123 commented 8 years ago

In case it’s unclear, all of the “_STALLED” values just help indicate which set of threads tends to be idle. This helps tell if we’re I/O bound or CPU bound. I don’t expect them to have relevant info. regarding the exact number of events. (both of these jobs are CPU bound)

Which file were you using here?

On Oct 20, 2016, at 12:20 PM, Sean Dobbs notifications@github.com wrote:

Well, this is odd. Both jobs report the same number of EVIO blocks read, but slightly different number of events processed in JANA.

Number of events: CentOS 6: command line: 960192 monitoring_hists/IsEvent: 908804 CentOS 7: command line: 960123 monitoring_hists/IsEvent: 908842

The differences in the number of events on the command line and in monitoring_hists is presumably the number of "special" events (e.g. EPICS).

@faustus123 https://github.com/faustus123, maybe we are losing some events in the parsing stage?

Some other diagnositics:

CentOS6

EVIO Processing rate = 92.6819 Hz NDISPATCHER_STALLED = 9567087 (92.3%) NPARSER_STALLED = 12061830 (58.2%) NEVENTBUFF_STALLED = 2 ( 0.0%)

CentOS7

EVIO Processing rate = 362.049 Hz NDISPATCHER_STALLED = 2482786 (93.6%) NPARSER_STALLED = 159191 ( 3.0%) NEVENTBUFF_STALLED = 407923 (15.4%)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/sim-recon/issues/591#issuecomment-255154898, or mute the thread https://github.com/notifications/unsubscribe-auth/AMPm5VTV5PepRNU6CS92c6qdjyXYQ_mAks5q15TQgaJpZM4KYGiB.