cms-L1TK / firmware-hls

HLS implementation of the tracklet pattern reco modules of the Hybrid tracking chain
15 stars 24 forks source link

Attempt to fix Future L1 Track SW + bugs in HLS memory classes. #347

Open tomalin opened 1 month ago

tomalin commented 1 month ago

The Future L1Trk CMSSW SW has not yet migrated to the 2FPGA solution. I'm therefore trying to bodge it so it continues to work (for now) with the 1FPGA solution.

1) The three functions in TrackBuilder_parameters.h with names getMPAR*() caused the CMSSW compiler to fail because of "duplicate function definitions". I fixed this by declaring them to be "constexpr" or "inline" in generate_TB.py. I also (on style grounds) changed their template parameter from type int to TF::seed. 2) MemoryTemplateBinnedCM::clear() used constant NCOPY, which caused array-out-of-bounds errors in CMSSW, where NCOPY can be zero. Replaced NCOPY with NCP. 3) CMSSW compiler errors about "duplicate code", solved by deleting the calculation of nentries withing CMSSW_GIT_HASH that was in MemoryTemplateBinnedCM & MemoryTemplateBinned & MemoryTemplatePROJ. 4) Added constants DEPTH_BX and DEPTH_ADDR to MemoryTemplate.h, MemoryTemplateBinnedCM.h and MemoryTemplateTPROJ.h, to simplify them. In MemoryTemplateTPROJ.h, replaced magic number 128 by DEPTH_ADDR. 5) Modified MemoryTemplateTPROJ to it instantiates only one page of memory when used inside CMSSW. 6) Replaced argument addr_index of MemoryTemplate::write_mem_new() by argument "ap_uint<1> overwrite", with the same number of bits used by MatchProcessor.h HERE when it calls write_mem_new(). Howver, this code inside write_mem_new(), which should protect the memory against having too many entries does nothing, since addrindex is a 1 bit number. This is a BUG. I replaced it by a check that nentries[ibx] is than than (1<<NBIT_ADDR). However, array access can be slow in FW, so one should check if this causes FW timing errors. 8) Besides the style issue that it's hard-wiring the number of pages to 4, https://github.com/cms-L1TK/firmware-hls/blob/master/TestBenches/FileReadUtility.h#L259 seems to incorrectly assume that MemoryTemplateTPROJ::getDepth() returns the total depth of the memory, whereas it can be seen from https://github.com/cms-L1TK/firmware-hls/blob/master/TrackletAlgorithm/MemoryTemplateTPROJ.h#L49 that it returns a number that is a factor NPAGE less than this. I believe this is a BUG and that compareMemWithMemPage() is only checking one quarter of the memory. Fixed. 9) SWLUTReader.h was writing by 1 unit beyond the array bounds of the input "lut" array, because it neglected that the input .tab files contain a final line with a "close bracket" but no number. This BUG had horrible consequences, causing all LUT contents to become zero in CMSSW. Now fixed. 10) Modified download.sh, so that when running the CombinedCM chain, the TP takes its LUT tables from LUTsSplit/ instead of LUTsCM/ . This reduces the rate of memory errors, as LUTsCM/ is out-of-date. All other modules in the chain already used LUTsSplit/.

THIS PR NEEDS DISCUSSION WITH @aehart and @aryd BEFORE MERGING.

tomalin commented 1 month ago

Strange. This fails git CI because of 2FA authentification issues https://github.com/cms-L1TK/firmware-hls/actions/runs/10853460640/job/30121761689?pr=347 . @aehart the L1Trk CMSSW code in our github cms-L1TK area runs CI in gitlab, and gitlab has a token (which I created) https://gitlab.cern.ch/cms-l1tk/cmssw_CI/-/settings/access_tokens so that it can access our github repo. Our github repo stores this at https://github.com/cms-L1TK/cmssw/settings/secrets/actions . The firmware-hls cms-L1TK github repo has no such tokens https://gitlab.cern.ch/cms-l1tk/firmware-hls/-/settings/access_tokens . But I'd have predicted that this would could 2FA failures for the HLS since Oct. 2023, not just now.

tomalin commented 1 month ago

QUESTIONS/ISSUES: 1) With this PR, the L1Trk Future SW runs in CMSSW14, but reports huge numbers of memory content inconsistencies in the TP, MP & TB. Although we use the CombinedCM wiring map, we use the memory content .txt files & LUT tables from the FPGASplit/ solution, as they are more up-to-date, and reduce the error count. The errors seen in the TP are due to expected the Tracklets all being found, but stored in a different order in the TPAR memories to in the reference files. 2) Anders's branch of the old CMSSW emulation no longer writes the 1-FPGA solution memories to the .txt files. This will need fixing if the Future SW is not to see more memory errors.