Closed nakuldhotre closed 1 year ago
The issue is reproducible on master (1.11.4-dev+mainnet) and I don't think this is a regression.
@nakuldhotre Thanks for reporting this, we'll take a look, though it may take some time as it's not easily reproducible on our test hardware.
Thanks @cryptonemo. If it helps, you can provide a branch or patch with extra logging and I will be happy to try it out on my setup.
+1
@cryptonemo in the interim would it be possible to make a simple change to add one additional debug log here: https://github.com/filecoin-project/rust-fil-proofs/blob/master/filecoin-proofs/src/pieces.rs#L34
It'd be helpful for us to know what the calculated commitment is from the pieces versus the expected commitment. The issue we are having is that the compute commitment function is not appearing to have any issues.
+1
@cryptonemo in the interim would it be possible to make a simple change to add one additional debug log here: https://github.com/filecoin-project/rust-fil-proofs/blob/master/filecoin-proofs/src/pieces.rs#L34
It'd be helpful for us to know what the calculated commitment is from the pieces versus the expected commitment. The issue we are having is that the compute commitment function is not appearing to have any issues.
Sure, try it out locally and see if you can find a useful balance. Bonus points for a PR.
@cryptonemo we added extra logging around the failure and are seeing the expected commitment and calculated commitment from the pieces are completely different. Is there anything that is telling from the commitments on what the issue may be and where we should focus our debugging efforts against? Since we are relying on the parent cache is it possible it gets into a bad state due to network congestion and packet drops?
2021-11-20T01:03:34.052 INFO storage_proofs_porep::stacked::vanilla::proof > replicate_phase1
2021-11-20T01:03:34.052 INFO storage_proofs_porep::stacked::vanilla::graph > using parent_cache[2048 / 2147483648]
2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > verifying pieces
2021-11-20T01:03:35.374 INFO filecoin_proofs::pieces > verifying 1 pieces
2021-11-20T01:03:35.374 INFO filecoin_proofs::pieces > comm_d_calculated: [
139,
165,
171,
137,
136,
221,
133,
105,
85,
38,
241,
64,
36,
150,
220,
153,
50,
210,
124,
15,
135,
226,
11,
159,
195,
125,
255,
116,
188,
108,
96,
35,
]
2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > Verification Information:
2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > comm_d: [
254,
140,
12,
167,
200,
72,
54,
54,
45,
240,
46,
209,
170,
220,
63,
137,
94,
66,
113,
91,
157,
65,
51,
183,
36,
132,
202,
26,
128,
86,
100,
50,
]
2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > piece_infos: [
PieceInfo {
commitment: "8ba5ab8988dd85695526f1402496dc9932d27c0f87e20b9fc37dff74bc6c6023",
size: UnpaddedBytesAmount(
68182605824,
),
},
]
2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > sector_size: SectorSize(
68719476736,
)
2021-11-20T01:03:35.374 INFO filcrypto::proofs::api > seal_pre_commit_phase1: finish
2021-11-20T01:08:28.470-0800 WARN lotus-bench lotus-bench/main.go:115 failed to run seals:
main.glob..func3
/home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:274
- commit:
main.runSeals.func1.1
/home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:566
- presealing sector 1 (/mnt/md0/bench126024744/unsealed/s-t01000-1):
github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).SealPreCommit1
/home/localadmin/Projects/filecoin/lotus/extern/sector-storage/ffiwrapper/sealer_cgo.go:531
- pieces and comm_d do not match
github.com/filecoin-project/filecoin-ffi.SealPreCommitPhase1
/home/localadmin/Projects/filecoin/lotus/extern/filecoin-ffi/proofs.go:358
github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).SealPreCommit1
/home/localadmin/Projects/filecoin/lotus/extern/sector-storage/ffiwrapper/sealer_cgo.go:520
main.runSeals.func1.1
/home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:564
main.runSeals.func1
/home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:685
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1581
@cryptonemo wanted to bump this with some extra information.
Adding a retry around the seal phase of an individual sector resolves the issue. If it fails on the first, it will attempt to seal that sector again and it works on the subsequent try.
Logs that we have added for details:
Writing a piece into Sector 5 to generate a data commitment
2021-12-04T13:09:01.496-0800 INFO lotus-bench lotus-bench/main.go:526 [5] Writing piece into sector...
2021-12-04T13:48:52.729 INFO filcrypto::proofs::api > generate_data_commitment: start
2021-12-04T13:48:53.094 INFO filcrypto::proofs::api > response from GenerateDataCommitment is fil_GenerateDataCommitmentResponse {
status_code: FCPNoError,
error_msg: 0x0000000000000000,
comm_d: [
221,
26,
120,
173,
34,
85,
2,
204,
31,
92,
43,
35,
19,
30,
12,
216,
70,
33,
251,
164,
44,
60,
119,
159,
230,
165,
126,
205,
65,
219,
73,
39,
],
}
2021-12-04T13:48:53.094 INFO filcrypto::proofs::api > generate_data_commitment: finish
1st attempt failed to calculate the data commitment from building the merkle tree directly from the data in the Sector 5.
2021-12-04T19:00:55.673 INFO storage_proofs_porep::stacked::vanilla::proof > replicate_phase1
2021-12-04T19:00:55.673 INFO storage_proofs_porep::stacked::vanilla::graph > using parent_cache[2048 / 2147483648]
2021-12-04T19:02:03.582 INFO filecoin_proofs::api::seal > the root commitment: Scalar(0x39bc2c9679dda9f5c22c3a0d41b948c496514c5b3da67968606ff0cb3e39c478)
2021-12-04T19:02:03.582 INFO filecoin_proofs::api::seal > the commitment from root: [
120,
196,
57,
62,
203,
240,
111,
96,
104,
121,
166,
61,
91,
76,
81,
150,
196,
72,
185,
65,
13,
58,
44,
194,
245,
169,
221,
121,
150,
44,
188,
57,
]
2021-12-04T19:02:03.589 INFO filecoin_proofs::pieces > comm_d_calculated: [
221,
26,
120,
173,
34,
85,
2,
204,
31,
92,
43,
35,
19,
30,
12,
216,
70,
33,
251,
164,
44,
60,
119,
159,
230,
165,
126,
205,
65,
219,
73,
39,
]
2021-12-04T19:02:03.589 INFO filecoin_proofs::api::seal > Verification Information:
2021-12-04T19:02:03.589 INFO filecoin_proofs::api::seal > comm_d: [
120,
196,
57,
62,
203,
240,
111,
96,
104,
121,
166,
61,
91,
76,
81,
150,
196,
72,
185,
65,
13,
58,
44,
194,
245,
169,
221,
121,
150,
44,
188,
57,
]
2021-12-04T19:02:03.589 INFO filecoin_proofs::api::seal > piece_infos: [
PieceInfo {
commitment: "dd1a78ad225502cc1f5c2b23131e0cd84621fba42c3c779fe6a57ecd41db4927",
size: UnpaddedBytesAmount(
68182605824,
),
},
]
2021-12-04T19:02:03.590 INFO filecoin_proofs::api::seal > sector_size: SectorSize(
68719476736,
)
2021-12-04T19:02:03.590 INFO filcrypto::proofs::api > seal_pre_commit_phase1: finish
2021-12-04T19:02:03.623-0800 ERROR lotus-bench lotus-bench/main.go:579 [5] Error %!w(*xerrors.wrapError=&{presealing sector 5 (/mnt/md0/bench3191370624/unsealed/s-t01000-5) 0xc000630030 {[41116210 43448743 55682358]}})
Retry is kicked off to attempt to seal Sector 5 again.
2021-12-04T19:02:03.628-0800 INFO lotus-bench lotus-bench/main.go:573 [5] Running replication retry count 2 ...
2021-12-04T19:02:03.629-0800 WARN ffiwrapper ffiwrapper/sealer_cgo.go:492 existing cache in /mnt/md0/bench3191370624/cache/s-t01000-5; removing
2021-12-04T19:02:18.162 INFO filcrypto::proofs::api > seal_pre_commit_phase1: start
2021-12-04T19:02:18.162 INFO filecoin_proofs::api::seal > seal_pre_commit_phase1:start: SectorId(5)
2021-12-04T19:02:18.162 INFO filecoin_proofs::api::seal > copy from: /mnt/md0/bench3191370624/unsealed/s-t01000-5 to: /mnt/md0/bench3191370624/sealed/s-t01000-5
2nd attempt succeeds to calculate the data commitment from building the merkle tree directly from the data in the Sector 5.
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > the root commitment: Scalar(0x2749db41cd7ea5e69f773c2ca4fb2146d80c1e13232b5c1fcc025522ad781add)
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > the commitment from root: [
221,
26,
120,
173,
34,
85,
2,
204,
31,
92,
43,
35,
19,
30,
12,
216,
70,
33,
251,
164,
44,
60,
119,
159,
230,
165,
126,
205,
65,
219,
73,
39,
]
2021-12-04T20:15:28.244 INFO filecoin_proofs::pieces > comm_d_calculated: [
221,
26,
120,
173,
34,
85,
2,
204,
31,
92,
43,
35,
19,
30,
12,
216,
70,
33,
251,
164,
44,
60,
119,
159,
230,
165,
126,
205,
65,
219,
73,
39,
]
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > Verification Information:
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > comm_d: [
221,
26,
120,
173,
34,
85,
2,
204,
31,
92,
43,
35,
19,
30,
12,
216,
70,
33,
251,
164,
44,
60,
119,
159,
230,
165,
126,
205,
65,
219,
73,
39,
]
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > piece_infos: [
PieceInfo {
commitment: "dd1a78ad225502cc1f5c2b23131e0cd84621fba42c3c779fe6a57ecd41db4927",
size: UnpaddedBytesAmount(
68182605824,
),
},
]
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > sector_size: SectorSize(
68719476736,
)
This still seems like we can't rule out 'bad' hardware with bits flipped.
@cryptonemo can you elaborate further on that statement. If it were a hardware issue wouldn't it fail consistently?
With it failing intermittently it seems more of an issue in the benchmark itself with the memory mapped files. Especially given the fact we see know errors on our hardware or in dmesg logs on the host where the benchmark is running.
@cryptonemo can you elaborate further on that statement. If it were a hardware issue wouldn't it fail consistently?
With it failing intermittently it seems more of an issue in the benchmark itself with the memory mapped files. Especially given the fact we see know errors on our hardware or in dmesg logs on the host where the benchmark is running.
No, not necessarily. If you have 'bad' disk or RAM, it can fail one time and be ok the next. Repeating testing on it (e.g. memtest86/memtester) will usually show the errors over time. Disk errors can be seen usually by system logs (i.e. dmesg/syslog)
No, not necessarily. If you have 'bad' disk or RAM, it can fail one time and be ok the next. Repeating testing on it (e.g. memtest86/memtester) will usually show the errors over time. Disk errors can be seen usually by system logs (i.e. dmesg/syslog)
Given that there had been past fixes to put in flushes around mmap to resolve similar issues, could we not be running into a similar problem here? This was the other fix put in: https://github.com/filecoin-project/rust-fil-proofs/pull/1493
The reason I am pessimistic about it being a hardware issue is that we only see this issue if we explicitly simulate network congestion via packet drops on the network through iptables rules. Without setting the iptables rules we do not ever run into this issue. We also do not report any issues in the dmseg logs and the memory is nowhere near max utilization throughout the benchmark.
Ok, I'm not saying it's certainly a hardware issue, but that it can't be ruled out just yet. I suspect a retry won't always work though.
@cryptonemo that's fair, in each of our runs where the benchmark has failed the retry has 100% of the times worked successfully on sealing the sector.
Based on the information provided, is there any areas in particular that we could focus our efforts on with respect to the sealing code (the merkle tree generation)? We have already tried doing a mmap flush directly after the merkle tree generation and did not have any effect as we still reproduced the failure.
Not sure. In the most paranoid case, flushing after each update may help (I'm curious if that still sees the error, while not really practical). It depends on how your system is stressed and what's going I suppose.
@cryptonemo we are willing to try that out.
To make sure we apply the flush to the correct location, would you mind pointing me to the area where each update is being executed? I want to make sure we would set it in the right spot.
I don't know where you're seeing the error (which tree builder is causing it?)
This is where the error happens: https://github.com/filecoin-project/rust-fil-proofs/blob/master/filecoin-proofs/src/api/seal.rs#L153
After the merkle tree is generated the data commitment generated from the tree fails when verified against the commitment retrieved from the piece in the sector.
Ok, the method is located here: https://github.com/filecoin-project/rust-fil-proofs/blob/master/storage-proofs-core/src/merkle/builders.rs#L177
The place you probably want to update is here: https://github.com/filecoin-project/merkletree/blob/master/src/store/disk.rs#L391
Thanks @cryptonemo we will flush after each mmap write and report our findings.
Following changes are going to be tested against:
~/Projects/filecoin/merkletree$ git diff
diff --git a/rust-toolchain b/rust-toolchain
index bb68f08..bf867e0 100644
--- a/rust-toolchain
+++ b/rust-toolchain
@@ -1 +1 @@
-nightly-2020-01-26
+nightly
diff --git a/src/store/disk.rs b/src/store/disk.rs
index b482d9b..27cbb5a 100644
--- a/src/store/disk.rs
+++ b/src/store/disk.rs
@@ -390,6 +390,9 @@ impl<E: Element> Store<E> for DiskStore<E> {
write_mmap[0..hashed_nodes_as_bytes_len].copy_from_slice(&hashed_nodes_as_bytes);
+ write_mmap.as_mut().flush()
+ .with_context(|| format!("could not flush memory write_mmap"))?;
+
Ok(())
})
}
@cryptonemo the benchmark still failed if we flushed on every mmap update.
A couple of observations from the benchmark run:
~/Projects/filecoin-log$ grep -i "try" job_run_log-2021-12-17-10-51-40
2021-12-17T14:54:05.528-0800 INFO lotus-bench lotus-bench/main.go:573 [5] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800 INFO lotus-bench lotus-bench/main.go:573 [2] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800 INFO lotus-bench lotus-bench/main.go:573 [3] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800 INFO lotus-bench lotus-bench/main.go:573 [4] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800 INFO lotus-bench lotus-bench/main.go:573 [1] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800 INFO lotus-bench lotus-bench/main.go:573 [0] Running replication try count 1 ...
2021-12-17T20:56:54.919-0800 INFO lotus-bench lotus-bench/main.go:573 [4] Running replication try count 2 ...
2021-12-17T20:56:55.841-0800 INFO lotus-bench lotus-bench/main.go:573 [1] Running replication try count 2 ...
2021-12-17T20:56:55.852-0800 INFO lotus-bench lotus-bench/main.go:573 [2] Running replication try count 2 ...
2021-12-17T21:01:39.156-0800 INFO lotus-bench lotus-bench/main.go:573 [3] Running replication try count 2 ...
2021-12-18T02:18:58.588 INFO storage_proofs_porep::stacked::vanilla::cache > [open] parent cache: calculated consistency digest: "025b37777f4c8eb6df528397da44f2c719c6a8a7f0b310cda2315bc7f4b94d6c"
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > [!!!] Parent cache digest mismatch detected. Regenerating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > parent cache: generating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache
We have not seen a benchmark fail if we have cleaned up the benchmark data from previous runs. However if we do not perform any cleanup, we consistently run into the data commitment issue. Is it possible the cache data from previous benchmark runs could causing issues with subsequent runs?
@cryptonemo Looking further at the parent cache code and our findings that the issue always stems when the parent cache exists in a previous run. Would it make sense to see if flushing the mmap for it as well helps? https://github.com/filecoin-project/rust-fil-proofs/blob/master/storage-proofs-porep/src/stacked/vanilla/cache.rs#L244
This also does not seem to make sense to me as it is not guaranteed that the parent cache has completed succcessfully just because the cache file exists. We have killed benchmarks in the middle of runs where it had not completed generating and the same path is used for the parent cache in the subsequent run: https://github.com/filecoin-project/rust-fil-proofs/blob/master/storage-proofs-porep/src/stacked/vanilla/cache.rs#L166
if path.exists() {
// If the cache file exists and we've got the lock, generation has previously been
// completed. Insert that it no longer needs generation at this point unconditionally.
if generated.get(&generation_key).is_none() {
generated.insert(generation_key);
}
Self::open(len, cache_entries, graph, &path)
}
The last question I have is there a way to specify a unique parent cache key for each benchmark? Based on what we are observing it sounds very similar to this issue where if the parent cache gets in a bad state the only way to proceed is to remove it: https://github.com/filecoin-project/rust-fil-proofs/issues/1264#issuecomment-678428428
There is a verify that you should probably be running with enabled if you're seeing consistency issues.
Related work to address this: https://github.com/filecoin-project/rust-fil-proofs/pull/1265 https://github.com/filecoin-project/rust-fil-proofs/pull/1496
@cryptonemo we are running the benchmark with the following configurations, the last of which is doing the cache verification:
export FIL_PROOFS_USE_MULTICORE_SDR=0
export FIL_PROOFS_MAXIMIZE_CACHE=0
export FIL_PROOFS_MAXIMIZE_CACHING=0
export FIL_PROOFS_VERIFY_CACHE=1
This seems to be cause of our benchmark failing consistently. If the parent cache file path exists but had not completed generating, we run into the pieces and comm_d do not match
error in seal pre commit phase 1.
Hrm, ok and in the logging do you ever see a failure on the parent's cache and it being regenerated during the run?
Yes but only after the data commitments had failed and the retry we implemented had succeeded. Once the retry kicked in to succeed on the data commitment verification for sectors that originally failed, we'll see the following message after:
2021-12-18T02:18:58.588 INFO storage_proofs_porep::stacked::vanilla::cache > [open] parent cache: calculated consistency digest: "025b37777f4c8eb6df528397da44f2c719c6a8a7f0b310cda2315bc7f4b94d6c"
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > [!!!] Parent cache digest mismatch detected. Regenerating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > parent cache: generating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache
However this will not be possible unless the retries are implemented.
Ok, so you think there's still a bug in the parent's cache generation, or that there's a bug that doesn't allow recovering properly if it's detected bad? Would you mind posting an issue or PR with your finding and the code that you've got working? We can review it properly in the new year.
@cryptonemo not definitively sure what is causing it but there seems to be a correlation between the parent cache existing in a bad state and the seal pre commit 1 phase failing.
The retry logic actually exists in the lotus code and can be seen here. These are the main changes that circumvent the issue as adding the mmap flushes in this repo did not appear to have any positive effect.
~/Projects/filecoin/lotus$ git diff --cached
diff --git a/cmd/lotus-bench/main.go b/cmd/lotus-bench/main.go
index 0b8ec6fe3..ed600a9f1 100644
--- a/cmd/lotus-bench/main.go
+++ b/cmd/lotus-bench/main.go
@@ -559,9 +559,25 @@ func runSeals(sb *ffiwrapper.Sealer, sbfs *basicfs.Provider, numSectors int, par
trand := blake2b.Sum256(ticketPreimage)
ticket := abi.SealRandomness(trand[:])
+ var c int
+ var pc1o storage.PreCommit1Out
+ var err error
+
log.Infof("[%d] Running replication(1)...", i)
piece := []abi.PieceInfo{pieces[i]}
- pc1o, err := sb.SealPreCommit1(context.TODO(), sid, ticket, piece)
+
+ for {
+ if c >= 3 {
+ break
+ }
+ log.Infof("[%d] Running replication try count %d ...", i, c+1)
+ pc1o, err = sb.SealPreCommit1(context.TODO(), sid, ticket, piece)
+ if err == nil {
+ break
+ }
+ c++
+ log.Errorf("[%d] Error %w", i, err)
+ }
if err != nil {
return xerrors.Errorf("commit: %w", err)
}
Interesting, ok thanks. We'll have to try to get to the bottom of this and find a way to reproduce this within proofs.
@cryptonemo if it helps here is the script we are executing to initiate benchmarks that run into the issue:
#!/bin/bash -xe
cd $(dirname $0)
PREFIX=${PREFIX:-/mnt}
MD_PATH=${MD_PATH:-${PREFIX}/md0}
export FIL_PROOFS_PARENT_CACHE=${PARENT_CACHE:-${PREFIX}/parent_cache/64gib/}
export FIL_PROOFS_PARAMETER_CACHE=${PARAM_CACHE:-${PREFIX}/param_cache/64gib/}
export FIL_PROOFS_USE_MULTICORE_SDR=0
export FIL_PROOFS_MAXIMIZE_CACHE=0
export FIL_PROOFS_MAXIMIZE_CACHING=0
export FIL_PROOFS_VERIFY_CACHE=1
#export RUST_LOG=TRACE
export RUST_LOG=DEBUG
export TMPDIR=$(dirname ${FIL_PROOFS_PARENT_CACHE})/tmp/
mkdir -p ${FIL_PROOFS_PARENT_CACHE}
mkdir -p ${FIL_PROOFS_PARAMETER_CACHE}
mkdir -p ${TMPDIR}
logPath="../$(basename $(pwd))-log"
mkdir -p $logPath
dt=$(date "+%F-%H-%M-%S")
echo Run WITH packet drop | tee ${logPath}/job_run_log-$dt
drop_pers="0.01 0.02 0.03 0.04 0.05"
for p in $drop_pers ;do
iptables -A INPUT -i enp129s0 -m statistic --mode random --probability $p -j DROP
iptables -A OUTPUT -o enp129s0 -m statistic --mode random --probability $p -j DROP
on_exit()
{
rc=$?
echo "$0: Shutting down.. cleaning up"
set +e
if [[ -z $drop_pers ]];then
return $rc
fi
for p in $drop_pers ;do
iptables -D INPUT -i enp129s0 -m statistic --mode random --probability $p -j DROP
iptables -D OUTPUT -o enp129s0 -m statistic --mode random --probability $p -j DROP
done
return $rc
}
trap on_exit 0 1 2 3 6 15
dmesg > $logPath/demsg-before-$dt
cd lotus
time cgexec \
-g memory:lotusbench numactl \
--interleave=all ./lotus-bench sealing \
--storage-dir ${MD_PATH} \
--sector-size 64GiB \
--no-gpu \
--skip-commit2 \
--skip-unseal \
--parallel 6 \
--num-sectors 6 2>&1 | tee -a ../${logPath}/job_run_log-$dt
cd -
dmesg > $logPath/demsg-after-$dt
@cryptonemo did you have any chance to investigate this further?
Unfortunately, not yet
@cryptonemo checking in again. Any updates?
@cryptonemo bumping this again. Any updates?
@mnachmia31 We still don't have a way to reproduce this -- thus no progress on the issue.
@cryptonemo it's been a bit of time. Have you been able to reproduce this on your end yet?
@cryptonemo it's been a bit of time. Have you been able to reproduce this on your end yet?
No, I haven't seen this consistently here at all. We regularly test on locally attached storage (primarily) as well as network attached storage (mostly when we run out of space locally).
There doesn't seem to be a way to reproduce this issue reliably, hence I'm closing this issue. If someone finds a way to reproduce, please re-open this issue.
Description
On a NVMe over fabric based storage system, I have seen lotus-bench failing around 50-60 minute mark. This happens more often if the network is congested. And, if the failure is not seen during that time frame, the lotus-bench finishes without any other errors.
Environment
Failure
I suspect this is happening because a file is not flushed before it is being reopened by the next step.
Acceptance criteria
lotus-bench should not fail
Risks + pitfalls
Where to begin
64gib_raid0_6threads_ram818g_packetdrops_nic25g_4oct_run5.txt