adamantine-sim / adamantine

Software to simulate heat transfer for additive manufacturing
https://adamantine-sim.github.io/adamantine/
Other
39 stars 11 forks source link

Seg fault during covariance matrix construction for data assimilation only for particular numbers of MPI tasks #314

Open stvdwtt opened 1 month ago

stvdwtt commented 1 month ago

Summary: An IMTS gear simulation crashes after data assimilation only for the case where 27 MPI processes are used. 9, 18, 30, and 60 all work.

Test case: https://github.com/adamantine-sim/demonstration-cases/tree/main/IMTS_Parts/GEAR-OP04

To run this, copy the input files out of simulation_template into its parent GEAR-OP04 directory and then run adamantine.

We get a seg fault at this line: https://github.com/adamantine-sim/adamantine/blob/587a4ead79730ea7b30829346766f7e8bc781598/source/DataAssimilator.cc#L581

with an index that is greater than the number of degrees of freedom.

It's unclear whether the issue is in indices_ranks or if j has an invalid value.

@Rombur, could you take a look at this?

Rombur commented 1 month ago

Ughh it passed on my machine. My guess is that due to numerical error, ArborX is giving us results we cannot deal with. Have you tried running in debug mode?

stvdwtt commented 1 month ago

Yeah, debug mode is how I localized it. It fails with 27 ranks on the SCOPS-foundry-2 computer at the MDF

Get Outlook for iOShttps://aka.ms/o0ukef


From: Bruno Turcksin @.> Sent: Monday, September 9, 2024 2:32:58 PM To: adamantine-sim/adamantine @.> Cc: DeWitt, Stephen @.>; Author @.> Subject: [EXTERNAL] Re: [adamantine-sim/adamantine] Seg fault during covariance matrix construction for data assimilation only for particular numbers of MPI tasks (Issue #314)

Ughh it passed on my machine. My guess is that due to numerical error, ArborX is giving us results we cannot deal with. Have you tried running in debug mode?

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v2/url?u=https-3A__github.com_adamantine-2Dsim_adamantine_issues_314-23issuecomment-2D2338920801&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=cvKbjvbo_v3uDXaHX3YPi9Q4d2VzMcXorlcgE1fc2fY&m=tpwXtu1MSgy8tp1CNXYwo1zqoq9iOw5bOF5LrdiOof5VfC3tRtqDO_GbXRxOakx8&s=CKNt0vDlXMdZd4sdCS_EIgsLPddEZVdtqlaIXAFHgKQ&e=, or unsubscribehttps://urldefense.us/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACQHZ5ZYCRL6Q6VATAH3UP3ZVXZWVAVCNFSM6AAAAABN3IY2EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZYHEZDAOBQGE&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=cvKbjvbo_v3uDXaHX3YPi9Q4d2VzMcXorlcgE1fc2fY&m=tpwXtu1MSgy8tp1CNXYwo1zqoq9iOw5bOF5LrdiOof5VfC3tRtqDO_GbXRxOakx8&s=UgzQHdHS9pKpwIR8EHvc730_AWxbbd75_H4Kbuthwn4&e=. You are receiving this because you authored the thread.Message ID: @.***>