LIHPC-Computational-Geometry / metis-rs

Idiomatic wrapper for METIS, the serial graph partitioner and fill-reducing matrix orderer
https://lihpc-computational-geometry.github.io/metis-rs/metis/
Apache License 2.0
11 stars 9 forks source link

Unexpected Behaviour when run concurrently #4

Closed Janekdererste closed 1 year ago

Janekdererste commented 1 year ago

Hi,

I want to use metis in my rust application. It works great and your binding is very nice to use. Thank you for providing this.

Since some time one of the tests I made when learning the interface started to occasionally fail (Hence it is set to ignore).

thread 'experiments::metis_test::tests::test_convert_example' panicked at 'assertion failed: `(left == right)`
  left: `[0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]`,
 right: `[1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0]`', src/experiments/metis_test.rs:81:9

Though, the unit tests uses the same data all the time it sometimes passes and sometimes not. The algorithm seems to produce different results inbetween runs.

I also noticed, that when I run my test suite on our HPC, the two tests in this module cause SIGSEGV errors from time to time.

I don't even know whether this is the right place to ask, but I thought I'll start at the very top of my stack trace 😬

cedricchevalier19 commented 1 year ago

Hi @Janekdererste , thank you for using and reporting this issue.

Concerning the run to run change, I think we can probably do a bit better by using METIS_OPTION_SEED.

For the concurrent runs, it is not clear to me that they will give the same results as running them sequentially. It has been a long since I have checked Metis source code but I think there can be some global variable for random seed and so on. It is something we should check with the wrapper.

Concerning the SEGV, are they inside Metis or in your application ? Scotch used to be more robust than Metis, we also have done a rust wrapper so it can be worth to give it a try.

hhirtz commented 1 year ago

I tried to run your tests with metis' assertions enabled and these warnings popped up:

Missing edge: (1 0)!
Missing edge: (2 1)!
Missing edge: (3 2)!
A total of 3 errors exist in the input file. Correct them, and run again!
***ASSERTION failed on line 90 of file /metis/libmetis/graph.c: CheckGraph(graph, ctrl->numflag, 1)

I think the problem comes from the conversion from Vec<Node>/Vec<Link> to metis' format here

https://github.com/Janekdererste/rust_q_sim/blob/1e01dd7d9173fe389c453570d459e2feb94e9347/src/experiments/metis_test.rs#L33-L37

you need to also add the "node.in_links". Metis wants links to go both ways or it will not work.

To enable metis assertions you need to recompile it:

make config shared=1 assert=1 assert2=1 && make -j8
Janekdererste commented 1 year ago

Thanks @hhirtz for this advice. It was actually the other test which was causing the error. This one only had a directed graph and the partition method didn't add the reverse edges to the adjncy list. This fixes the SIGSEGV errors.

Janekdererste commented 1 year ago

Concerning the SEGV, are they inside Metis or in your application ?

Though this is probably not a problem anymore, I'll put it here anyway. The error happened inside Metis but was probably caused due to directed edges in my graph.

*** Error in `/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe': corrupted size vs. prev_size: 0x00002b7350000d20 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x80c37)[0x2b7336190c37]
/lib64/libc.so.6(+0x8120e)[0x2b733619120e]
/sw/numerics/metis/5.1.0/skl/gcc.9.2.0/lib/libmetis.so(gk_free+0xa3)[0x2b733537f5f3]
/sw/numerics/metis/5.1.0/skl/gcc.9.2.0/lib/libmetis.so(gk_mcoreDestroy+0x5c)[0x2b733537a0ec]
/sw/numerics/metis/5.1.0/skl/gcc.9.2.0/lib/libmetis.so(libmetis__FreeWorkSpace+0x16)[0x2b73353b4c36]
/sw/numerics/metis/5.1.0/skl/gcc.9.2.0/lib/libmetis.so(libmetis__FreeCtrl+0x15)[0x2b73353b3745]
/sw/numerics/metis/5.1.0/skl/gcc.9.2.0/lib/libmetis.so(METIS_PartGraphKway+0x24d)[0x2b733539efdd]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x282ef5)[0x5644bbdfeef5]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x1e8f8c)[0x5644bbd64f8c]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x18b60b)[0x5644bbd0760b]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x1db84a)[0x5644bbd5784a]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x15e2ee)[0x5644bbcda2ee]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x242ea3)[0x5644bbdbeea3]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x241b80)[0x5644bbdbdb80]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x20d1c4)[0x5644bbd891c4]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x212a17)[0x5644bbd8ea17]
/home/bekjanek/rust_q_sim/target/debug/deps/rust_q_sim-c40d1e37f5fb7efe(+0x347713)[0x5644bbec3713]
/lib64/libpthread.so.0(+0x7ea5)[0x2b73359f5ea5]
/lib64/libc.so.6(clone+0x6d)[0x2b733620eb0d]
Janekdererste commented 1 year ago

Concerning the run to run change, I think we can probably do a bit better by using METIS_OPTION_SEED.

How would it be posible to use this flag @cedricchevalier19 ?

hhirtz commented 1 year ago

Sorry for the late response. For the sake of prosperity, METIS_OPTION_SEED and other options are accessible from the metis::option module. You can set options on the graph structure:

let xadj = &mut [0, 1, 2];
let adjncy = &mut [1, 0];
let random_number = 1234;
let mut part = [0, 0];

Graph::new(1, 2, xadj, adjncy)
    .set_option(metis::option::Seed(random_number))
    .part_recursive(&mut part)
    .unwrap();

// The two vertices are placed in different parts.
assert_ne!(part[0], part[1]);

If you have other questions, feel free to open an issue.