cms-L1TK / cmssw

Fork of CMSSW where improvements to L1 tracking code are developed.
http://cms-sw.github.io/
Apache License 2.0
4 stars 5 forks source link

Segmentation violation when writing tables #203

Closed aehart closed 1 year ago

aehart commented 1 year ago

A segmentation violation now occurs when following the instructions for generating test-vectors here: https://github.com/cms-L1TK/firmware-hls/tree/504b9711c7b620ffeb6d10fbe9715c113fb67527#full-configuration

The GNU debugger reports the following line in trklet::TrackletLUT::writeTable as the culprit: https://github.com/cms-L1TK/cmssw/blob/23a8a8cfaf3086866b503032b56d93578de8fcc1/L1Trigger/TrackFindingTracklet/src/TrackletLUT.cc#L1425

There is nothing obviously wrong with the above line, and I suspect some object, like a TrackletLUT, is not being allocated properly, or has been prematurely deallocated. I have not done any further debugging though.

aehart commented 1 year ago

I've traced this issue to nbits_ not being set for many of the TrackletLUT objects in the emulation. This causes width to get a junk value, which in turn causes a segmentation violation when setw is called.

This seems to be an issue for the following methods in TrackletLUT, i.e., in each of these methods, TrackletLUT::writeTable is called without nbits_ first being assigned a proper value, as far as I can tell:

@skinnari @tomalin, you're more familiar with who is responsible for these various LUTs. Could you help assign people to fix these methods?

I would also recommend initializing nbits_ to some junk value, like zero perhaps, and then raising an exception at the beginning of TrackletLUT::writeTable if nbits_ is still equal to this junk value. This is to avoid this issue in the future. Either that or add a CI test where writeTable_ is true.

tomalin commented 1 year ago

Why is it crashing now, and not previously? You're now using fw_synch_220523 . Was it running successfully with an earlier branch? If so, which one?

aehart commented 1 year ago

Why is it crashing now, and not previously? You're now using fw_synch_220523 . Was it running successfully with an earlier branch? If so, which one?

The last time I ran with writeTable_ set, I was using the fw_synch220523 tag. With this tag, TrackletLUT::writeTable only uses `nbitswhenpositive_is true. There is no call tosetw` when writing the table: https://github.com/cms-L1TK/cmssw/blob/306e80db0e0a95eb87ed5f2cd59493da8011336f/L1Trigger/TrackFindingTracklet/src/TrackletLUT.cc#L897

Now, I am trying to use the HEAD of the L1TK-dev-12_6_0pre5 branch, in order to do another FW synchronization. Now, TrackletLUT::writeTable uses `nbitsto set a width that is then passed tosetw`: https://github.com/cms-L1TK/cmssw/blob/f1951c72551d09202b7586773bbe3769250c12fa/L1Trigger/TrackFindingTracklet/src/TrackletLUT.cc#L1415 https://github.com/cms-L1TK/cmssw/blob/f1951c72551d09202b7586773bbe3769250c12fa/L1Trigger/TrackFindingTracklet/src/TrackletLUT.cc#L1425

So if nbits_ is uninitialized, then this width often winds up being extremely large, which, for some reason, causes a segmentation violation when passed to setw.

tomalin commented 1 year ago

@aehart I've reproduced your crash, and confirm your helpful diagnosis. It seems that people forgot to set "nbits" in any function where positive=false. The bug was introduced in https://github.com/cms-L1TK/cmssw/pull/189 . Although I'm named as the author of this PR, all I actually did was include and resolve conflicts between two closed PRs from @aryd and @bryates https://github.com/cms-L1TK/cmssw/pull/182 and https://github.com/cms-L1TK/cmssw/pull/185 respectively, both of which changed TrackletLUT.cc . The fatal line "int width = (nbits_ + 3) / 4;" was introduced in @aryd 's PR, so he is best placed to comment.

tomalin commented 1 year ago

Unless Anders makes a better suggestion, then if, immediately after the line "int width = (nbits + 3) / 4;" one adds the line "if (not positive) width = 0;", this prevents the crash and I think restores the original setw() behavior for negative numbers.

aryd commented 1 year ago

I’m completely saturated with a deadline tonight for submitting material for a review. I should be able to take a look tomorrow.

I fixed one instance of this bug where bits_ was not set a month or so ago in a pull request so I’m a little confused about why we have this problem - but I have not had a chance to look.

-Anders

Anders Ryd @.**@.>

On Feb 27, 2023, at 6:38 PM, Ian Tomalin @.**@.>> wrote:

Unless Anders makes a better suggestion, then if, immediately after the line "int width = (nbits + 3) / 4;" one adds the line "if (not positive) width = 0;", this prevents the crash and I think restores the original setw() behavior for negative numbers.

— Reply to this email directly, view it on GitHubhttps://github.com/cms-L1TK/cmssw/issues/203#issuecomment-1447286549, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABI4LTJMJW57Z7I6JAPF3ELWZU3HDANCNFSM6AAAAAAULPPDJM. You are receiving this because you were mentioned.Message ID: @.***>

tomalin commented 1 year ago

This issue is addressed by https://github.com/cms-L1TK/cmssw/pull/211 . Any further comments related to it should be made in the review of that PR.