lotus-bench fails with 'pieces and comm_d do not match' error

nakuldhotre commented 3 years ago

Description

On a NVMe over fabric based storage system, I have seen lotus-bench failing around 50-60 minute mark. This happens more often if the network is congested. And, if the failure is not seen during that time frame, the lotus-bench finishes without any other errors.

Environment

export FIL_PROOFS_PARENT_CACHE=/mnt/parent_cache/64gib/; export FIL_PROOFS_PARAMETER_CACHE=/mnt/param_cache/64gib/; export FIL_PROOFS_USE_MULTICORE_SDR=1; export TMPDIR=/mnt/param_cache/tmp/; export RUST_LOG=debug

cgexec -g memory:lotusbench numactl --interleave=all ./lotus-bench sealing --storage-dir /mnt/md0 --sector-size 64GiB --no-gpu --skip-commit2 --skip-unseal --parallel 6 --num-sectors 6 >/home/localadmin/64gib_raid0_6threads_ram818g_packetdrops_nic25g_4oct_run5.txt 2>&1

Failure

    main.glob..func3
        /home/localadmin/lotus/cmd/lotus-bench/main.go:274
  - commit:
    main.runSeals.func1.1
        /home/localadmin/lotus/cmd/lotus-bench/main.go:566
  - presealing sector 1 (/mnt/md0/bench159807454/unsealed/s-t01000-1):
    github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).SealPreCommit1
        /home/localadmin/lotus/extern/sector-storage/ffiwrapper/sealer_cgo.go:531
  - pieces and comm_d do not match
    github.com/filecoin-project/filecoin-ffi.SealPreCommitPhase1
        /home/localadmin/lotus/extern/filecoin-ffi/proofs.go:358
    github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).SealPreCommit1
        /home/localadmin/lotus/extern/sector-storage/ffiwrapper/sealer_cgo.go:520
    main.runSeals.func1.1
        /home/localadmin/lotus/cmd/lotus-bench/main.go:564
    main.runSeals.func1
        /home/localadmin/lotus/cmd/lotus-bench/main.go:685
    runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581

I suspect this is happening because a file is not flushed before it is being reopened by the next step.

Acceptance criteria

lotus-bench should not fail

Risks + pitfalls

Where to begin

64gib_raid0_6threads_ram818g_packetdrops_nic25g_4oct_run5.txt

nakuldhotre commented 3 years ago

The issue is reproducible on master (1.11.4-dev+mainnet) and I don't think this is a regression.

cryptonemo commented 3 years ago

@nakuldhotre Thanks for reporting this, we'll take a look, though it may take some time as it's not easily reproducible on our test hardware.

nakuldhotre commented 3 years ago

Thanks @cryptonemo. If it helps, you can provide a branch or patch with extra logging and I will be happy to try it out on my setup.

mnachmia31 commented 2 years ago

+1

@cryptonemo in the interim would it be possible to make a simple change to add one additional debug log here: https://github.com/filecoin-project/rust-fil-proofs/blob/master/filecoin-proofs/src/pieces.rs#L34

It'd be helpful for us to know what the calculated commitment is from the pieces versus the expected commitment. The issue we are having is that the compute commitment function is not appearing to have any issues.

cryptonemo commented 2 years ago

+1

@cryptonemo in the interim would it be possible to make a simple change to add one additional debug log here: https://github.com/filecoin-project/rust-fil-proofs/blob/master/filecoin-proofs/src/pieces.rs#L34

It'd be helpful for us to know what the calculated commitment is from the pieces versus the expected commitment. The issue we are having is that the compute commitment function is not appearing to have any issues.

Sure, try it out locally and see if you can find a useful balance. Bonus points for a PR.

mnachmia31 commented 2 years ago

@cryptonemo we added extra logging around the failure and are seeing the expected commitment and calculated commitment from the pieces are completely different. Is there anything that is telling from the commitments on what the issue may be and where we should focus our debugging efforts against? Since we are relying on the parent cache is it possible it gets into a bad state due to network congestion and packet drops?

2021-11-20T01:03:34.052 INFO storage_proofs_porep::stacked::vanilla::proof > replicate_phase1
2021-11-20T01:03:34.052 INFO storage_proofs_porep::stacked::vanilla::graph > using parent_cache[2048 / 2147483648]
2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > verifying pieces
2021-11-20T01:03:35.374 INFO filecoin_proofs::pieces > verifying 1 pieces
2021-11-20T01:03:35.374 INFO filecoin_proofs::pieces > comm_d_calculated: [
    139,
    165,
    171,
    137,
    136,
    221,
    133,
    105,
    85,
    38,
    241,
    64,
    36,
    150,
    220,
    153,
    50,
    210,
    124,
    15,
    135,
    226,
    11,
    159,
    195,
    125,
    255,
    116,
    188,
    108,
    96,
    35,
]

2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > Verification Information:
2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > comm_d: [
    254,
    140,
    12,
    167,
    200,
    72,
    54,
    54,
    45,
    240,
    46,
    209,
    170,
    220,
    63,
    137,
    94,
    66,
    113,
    91,
    157,
    65,
    51,
    183,
    36,
    132,
    202,
    26,
    128,
    86,
    100,
    50,
]

2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > piece_infos: [
    PieceInfo {
        commitment: "8ba5ab8988dd85695526f1402496dc9932d27c0f87e20b9fc37dff74bc6c6023",
        size: UnpaddedBytesAmount(
            68182605824,
        ),
    },
]

2021-11-20T01:03:35.374 INFO filecoin_proofs::api::seal > sector_size: SectorSize(
    68719476736,
)

2021-11-20T01:03:35.374 INFO filcrypto::proofs::api > seal_pre_commit_phase1: finish
2021-11-20T01:08:28.470-0800    WARN    lotus-bench lotus-bench/main.go:115 failed to run seals:
    main.glob..func3
        /home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:274
  - commit:
    main.runSeals.func1.1
        /home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:566
  - presealing sector 1 (/mnt/md0/bench126024744/unsealed/s-t01000-1):
    github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).SealPreCommit1
        /home/localadmin/Projects/filecoin/lotus/extern/sector-storage/ffiwrapper/sealer_cgo.go:531
  - pieces and comm_d do not match
    github.com/filecoin-project/filecoin-ffi.SealPreCommitPhase1
        /home/localadmin/Projects/filecoin/lotus/extern/filecoin-ffi/proofs.go:358
    github.com/filecoin-project/lotus/extern/sector-storage/ffiwrapper.(*Sealer).SealPreCommit1
        /home/localadmin/Projects/filecoin/lotus/extern/sector-storage/ffiwrapper/sealer_cgo.go:520
    main.runSeals.func1.1
        /home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:564
    main.runSeals.func1
        /home/localadmin/Projects/filecoin/lotus/cmd/lotus-bench/main.go:685
    runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581

mnachmia31 commented 2 years ago

@cryptonemo wanted to bump this with some extra information.

Adding a retry around the seal phase of an individual sector resolves the issue. If it fails on the first, it will attempt to seal that sector again and it works on the subsequent try.

Logs that we have added for details:

Writing a piece into Sector 5 to generate a data commitment

2021-12-04T13:09:01.496-0800    INFO    lotus-bench     lotus-bench/main.go:526 [5] Writing piece into sector...
2021-12-04T13:48:52.729 INFO filcrypto::proofs::api > generate_data_commitment: start
2021-12-04T13:48:53.094 INFO filcrypto::proofs::api > response from GenerateDataCommitment is fil_GenerateDataCommitmentResponse {
    status_code: FCPNoError,
    error_msg: 0x0000000000000000,
    comm_d: [
        221,
        26,
        120,
        173,
        34,
        85,
        2,
        204,
        31,
        92,
        43,
        35,
        19,
        30,
        12,
        216,
        70,
        33,
        251,
        164,
        44,
        60,
        119,
        159,
        230,
        165,
        126,
        205,
        65,
        219,
        73,
        39,
    ],
}
2021-12-04T13:48:53.094 INFO filcrypto::proofs::api > generate_data_commitment: finish

1st attempt failed to calculate the data commitment from building the merkle tree directly from the data in the Sector 5.

2021-12-04T19:00:55.673 INFO storage_proofs_porep::stacked::vanilla::proof > replicate_phase1
2021-12-04T19:00:55.673 INFO storage_proofs_porep::stacked::vanilla::graph > using parent_cache[2048 / 2147483648]
2021-12-04T19:02:03.582 INFO filecoin_proofs::api::seal > the root commitment: Scalar(0x39bc2c9679dda9f5c22c3a0d41b948c496514c5b3da67968606ff0cb3e39c478)
2021-12-04T19:02:03.582 INFO filecoin_proofs::api::seal > the commitment from root: [
    120,
    196,
    57,
    62,
    203,
    240,
    111,
    96,
    104,
    121,
    166,
    61,
    91,
    76,
    81,
    150,
    196,
    72,
    185,
    65,
    13,
    58,
    44,
    194,
    245,
    169,
    221,
    121,
    150,
    44,
    188,
    57,
]
2021-12-04T19:02:03.589 INFO filecoin_proofs::pieces > comm_d_calculated: [
    221,
    26,
    120,
    173,
    34,
    85,
    2,
    204,
    31,
    92,
    43,
    35,
    19,
    30,
    12,
    216,
    70,
    33,
    251,
    164,
    44,
    60,
    119,
    159,
    230,
    165,
    126,
    205,
    65,
    219,
    73,
    39,
]

2021-12-04T19:02:03.589 INFO filecoin_proofs::api::seal > Verification Information:
2021-12-04T19:02:03.589 INFO filecoin_proofs::api::seal > comm_d: [
    120,
    196,
    57,
    62,
    203,
    240,
    111,
    96,
    104,
    121,
    166,
    61,
    91,
    76,
    81,
    150,
    196,
    72,
    185,
    65,
    13,
    58,
    44,
    194,
    245,
    169,
    221,
    121,
    150,
    44,
    188,
    57,
]

2021-12-04T19:02:03.589 INFO filecoin_proofs::api::seal > piece_infos: [
    PieceInfo {
        commitment: "dd1a78ad225502cc1f5c2b23131e0cd84621fba42c3c779fe6a57ecd41db4927",
        size: UnpaddedBytesAmount(
            68182605824,
        ),
    },
]

2021-12-04T19:02:03.590 INFO filecoin_proofs::api::seal > sector_size: SectorSize(
    68719476736,
)

2021-12-04T19:02:03.590 INFO filcrypto::proofs::api > seal_pre_commit_phase1: finish
2021-12-04T19:02:03.623-0800    ERROR   lotus-bench     lotus-bench/main.go:579 [5] Error %!w(*xerrors.wrapError=&{presealing sector 5 (/mnt/md0/bench3191370624/unsealed/s-t01000-5) 0xc000630030 {[41116210 43448743 55682358]}})

Retry is kicked off to attempt to seal Sector 5 again.

2021-12-04T19:02:03.628-0800    INFO    lotus-bench     lotus-bench/main.go:573 [5] Running replication retry count 2 ...
2021-12-04T19:02:03.629-0800    WARN    ffiwrapper      ffiwrapper/sealer_cgo.go:492    existing cache in /mnt/md0/bench3191370624/cache/s-t01000-5; removing
2021-12-04T19:02:18.162 INFO filcrypto::proofs::api > seal_pre_commit_phase1: start
2021-12-04T19:02:18.162 INFO filecoin_proofs::api::seal > seal_pre_commit_phase1:start: SectorId(5)
2021-12-04T19:02:18.162 INFO filecoin_proofs::api::seal > copy from: /mnt/md0/bench3191370624/unsealed/s-t01000-5 to: /mnt/md0/bench3191370624/sealed/s-t01000-5

2nd attempt succeeds to calculate the data commitment from building the merkle tree directly from the data in the Sector 5.

2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > the root commitment: Scalar(0x2749db41cd7ea5e69f773c2ca4fb2146d80c1e13232b5c1fcc025522ad781add)
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > the commitment from root: [
    221,
    26,
    120,
    173,
    34,
    85,
    2,
    204,
    31,
    92,
    43,
    35,
    19,
    30,
    12,
    216,
    70,
    33,
    251,
    164,
    44,
    60,
    119,
    159,
    230,
    165,
    126,
    205,
    65,
    219,
    73,
    39,
]
2021-12-04T20:15:28.244 INFO filecoin_proofs::pieces > comm_d_calculated: [
    221,
    26,
    120,
    173,
    34,
    85,
    2,
    204,
    31,
    92,
    43,
    35,
    19,
    30,
    12,
    216,
    70,
    33,
    251,
    164,
    44,
    60,
    119,
    159,
    230,
    165,
    126,
    205,
    65,
    219,
    73,
    39,
]

2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > Verification Information:
2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > comm_d: [
    221,
    26,
    120,
    173,
    34,
    85,
    2,
    204,
    31,
    92,
    43,
    35,
    19,
    30,
    12,
    216,
    70,
    33,
    251,
    164,
    44,
    60,
    119,
    159,
    230,
    165,
    126,
    205,
    65,
    219,
    73,
    39,
]

2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > piece_infos: [
    PieceInfo {
        commitment: "dd1a78ad225502cc1f5c2b23131e0cd84621fba42c3c779fe6a57ecd41db4927",
        size: UnpaddedBytesAmount(
            68182605824,
        ),
    },
]

2021-12-04T20:15:28.244 INFO filecoin_proofs::api::seal > sector_size: SectorSize(
    68719476736,
)

cryptonemo commented 2 years ago

This still seems like we can't rule out 'bad' hardware with bits flipped.

mnachmia31 commented 2 years ago

@cryptonemo can you elaborate further on that statement. If it were a hardware issue wouldn't it fail consistently?

With it failing intermittently it seems more of an issue in the benchmark itself with the memory mapped files. Especially given the fact we see know errors on our hardware or in dmesg logs on the host where the benchmark is running.

cryptonemo commented 2 years ago

@cryptonemo can you elaborate further on that statement. If it were a hardware issue wouldn't it fail consistently?

With it failing intermittently it seems more of an issue in the benchmark itself with the memory mapped files. Especially given the fact we see know errors on our hardware or in dmesg logs on the host where the benchmark is running.

No, not necessarily. If you have 'bad' disk or RAM, it can fail one time and be ok the next. Repeating testing on it (e.g. memtest86/memtester) will usually show the errors over time. Disk errors can be seen usually by system logs (i.e. dmesg/syslog)

mnachmia31 commented 2 years ago

No, not necessarily. If you have 'bad' disk or RAM, it can fail one time and be ok the next. Repeating testing on it (e.g. memtest86/memtester) will usually show the errors over time. Disk errors can be seen usually by system logs (i.e. dmesg/syslog)

Given that there had been past fixes to put in flushes around mmap to resolve similar issues, could we not be running into a similar problem here? This was the other fix put in: https://github.com/filecoin-project/rust-fil-proofs/pull/1493

The reason I am pessimistic about it being a hardware issue is that we only see this issue if we explicitly simulate network congestion via packet drops on the network through iptables rules. Without setting the iptables rules we do not ever run into this issue. We also do not report any issues in the dmseg logs and the memory is nowhere near max utilization throughout the benchmark.

cryptonemo commented 2 years ago

Ok, I'm not saying it's certainly a hardware issue, but that it can't be ruled out just yet. I suspect a retry won't always work though.

mnachmia31 commented 2 years ago

@cryptonemo that's fair, in each of our runs where the benchmark has failed the retry has 100% of the times worked successfully on sealing the sector.

Based on the information provided, is there any areas in particular that we could focus our efforts on with respect to the sealing code (the merkle tree generation)? We have already tried doing a mmap flush directly after the merkle tree generation and did not have any effect as we still reproduced the failure.

cryptonemo commented 2 years ago

Not sure. In the most paranoid case, flushing after each update may help (I'm curious if that still sees the error, while not really practical). It depends on how your system is stressed and what's going I suppose.

mnachmia31 commented 2 years ago

@cryptonemo we are willing to try that out.

To make sure we apply the flush to the correct location, would you mind pointing me to the area where each update is being executed? I want to make sure we would set it in the right spot.

cryptonemo commented 2 years ago

I don't know where you're seeing the error (which tree builder is causing it?)

mnachmia31 commented 2 years ago

This is where the error happens: https://github.com/filecoin-project/rust-fil-proofs/blob/master/filecoin-proofs/src/api/seal.rs#L153

After the merkle tree is generated the data commitment generated from the tree fails when verified against the commitment retrieved from the piece in the sector.

cryptonemo commented 2 years ago

Ok, the method is located here: https://github.com/filecoin-project/rust-fil-proofs/blob/master/storage-proofs-core/src/merkle/builders.rs#L177

The place you probably want to update is here: https://github.com/filecoin-project/merkletree/blob/master/src/store/disk.rs#L391

mnachmia31 commented 2 years ago

Thanks @cryptonemo we will flush after each mmap write and report our findings.

Following changes are going to be tested against:

~/Projects/filecoin/merkletree$ git diff
diff --git a/rust-toolchain b/rust-toolchain
index bb68f08..bf867e0 100644
--- a/rust-toolchain
+++ b/rust-toolchain
@@ -1 +1 @@
-nightly-2020-01-26
+nightly
diff --git a/src/store/disk.rs b/src/store/disk.rs
index b482d9b..27cbb5a 100644
--- a/src/store/disk.rs
+++ b/src/store/disk.rs
@@ -390,6 +390,9 @@ impl<E: Element> Store<E> for DiskStore<E> {

                 write_mmap[0..hashed_nodes_as_bytes_len].copy_from_slice(&hashed_nodes_as_bytes);

+                write_mmap.as_mut().flush()
+                    .with_context(|| format!("could not flush memory write_mmap"))?;
+
                 Ok(())
             })
     }

mnachmia31 commented 2 years ago

@cryptonemo the benchmark still failed if we flushed on every mmap update.

A couple of observations from the benchmark run:

Retry succeeded on the second time attempting to seal all the sectors that had failed (4 out of 6) for the data commitment mismatch.

~/Projects/filecoin-log$ grep -i "try" job_run_log-2021-12-17-10-51-40
2021-12-17T14:54:05.528-0800    INFO    lotus-bench lotus-bench/main.go:573 [5] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800    INFO    lotus-bench lotus-bench/main.go:573 [2] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800    INFO    lotus-bench lotus-bench/main.go:573 [3] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800    INFO    lotus-bench lotus-bench/main.go:573 [4] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800    INFO    lotus-bench lotus-bench/main.go:573 [1] Running replication try count 1 ...
2021-12-17T14:54:05.528-0800    INFO    lotus-bench lotus-bench/main.go:573 [0] Running replication try count 1 ...
2021-12-17T20:56:54.919-0800    INFO    lotus-bench lotus-bench/main.go:573 [4] Running replication try count 2 ...
2021-12-17T20:56:55.841-0800    INFO    lotus-bench lotus-bench/main.go:573 [1] Running replication try count 2 ...
2021-12-17T20:56:55.852-0800    INFO    lotus-bench lotus-bench/main.go:573 [2] Running replication try count 2 ...
2021-12-17T21:01:39.156-0800    INFO    lotus-bench lotus-bench/main.go:573 [3] Running replication try count 2 ...

After the retries have succeeded and the data commitments generated match for each of the 6 sectors, the logs indicate an inconsistency in the caches used in the benchmark:

2021-12-18T02:18:58.588 INFO storage_proofs_porep::stacked::vanilla::cache > [open] parent cache: calculated consistency digest: "025b37777f4c8eb6df528397da44f2c719c6a8a7f0b310cda2315bc7f4b94d6c"
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > [!!!] Parent cache digest mismatch detected.  Regenerating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > parent cache: generating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache

We have not seen a benchmark fail if we have cleaned up the benchmark data from previous runs. However if we do not perform any cleanup, we consistently run into the data commitment issue. Is it possible the cache data from previous benchmark runs could causing issues with subsequent runs?

mnachmia31 commented 2 years ago

@cryptonemo Looking further at the parent cache code and our findings that the issue always stems when the parent cache exists in a previous run. Would it make sense to see if flushing the mmap for it as well helps? https://github.com/filecoin-project/rust-fil-proofs/blob/master/storage-proofs-porep/src/stacked/vanilla/cache.rs#L244

This also does not seem to make sense to me as it is not guaranteed that the parent cache has completed succcessfully just because the cache file exists. We have killed benchmarks in the middle of runs where it had not completed generating and the same path is used for the parent cache in the subsequent run: https://github.com/filecoin-project/rust-fil-proofs/blob/master/storage-proofs-porep/src/stacked/vanilla/cache.rs#L166

if path.exists() {
            // If the cache file exists and we've got the lock, generation has previously been
            // completed.  Insert that it no longer needs generation at this point unconditionally.
            if generated.get(&generation_key).is_none() {
                generated.insert(generation_key);
            }
            Self::open(len, cache_entries, graph, &path)
        }

The last question I have is there a way to specify a unique parent cache key for each benchmark? Based on what we are observing it sounds very similar to this issue where if the parent cache gets in a bad state the only way to proceed is to remove it: https://github.com/filecoin-project/rust-fil-proofs/issues/1264#issuecomment-678428428

cryptonemo commented 2 years ago

There is a verify that you should probably be running with enabled if you're seeing consistency issues.

mnachmia31 commented 2 years ago

@cryptonemo we are running the benchmark with the following configurations, the last of which is doing the cache verification:

export FIL_PROOFS_USE_MULTICORE_SDR=0
export FIL_PROOFS_MAXIMIZE_CACHE=0
export FIL_PROOFS_MAXIMIZE_CACHING=0
export FIL_PROOFS_VERIFY_CACHE=1

This seems to be cause of our benchmark failing consistently. If the parent cache file path exists but had not completed generating, we run into the pieces and comm_d do not match error in seal pre commit phase 1.

cryptonemo commented 2 years ago

Hrm, ok and in the logging do you ever see a failure on the parent's cache and it being regenerated during the run?

mnachmia31 commented 2 years ago

Yes but only after the data commitments had failed and the retry we implemented had succeeded. Once the retry kicked in to succeed on the data commitment verification for sectors that originally failed, we'll see the following message after:

2021-12-18T02:18:58.588 INFO storage_proofs_porep::stacked::vanilla::cache > [open] parent cache: calculated consistency digest: "025b37777f4c8eb6df528397da44f2c719c6a8a7f0b310cda2315bc7f4b94d6c"
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > [!!!] Parent cache digest mismatch detected.  Regenerating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache
2021-12-18T02:18:58.593 INFO storage_proofs_porep::stacked::vanilla::cache > parent cache: generating /mnt/parent_cache/64gib/v28-sdr-parent-4905486b7af19558ac3649bc6261411858b6add534438878c4ee3b29d8b9de0b.cache

However this will not be possible unless the retries are implemented.

cryptonemo commented 2 years ago

Ok, so you think there's still a bug in the parent's cache generation, or that there's a bug that doesn't allow recovering properly if it's detected bad? Would you mind posting an issue or PR with your finding and the code that you've got working? We can review it properly in the new year.

mnachmia31 commented 2 years ago

@cryptonemo not definitively sure what is causing it but there seems to be a correlation between the parent cache existing in a bad state and the seal pre commit 1 phase failing.

The retry logic actually exists in the lotus code and can be seen here. These are the main changes that circumvent the issue as adding the mmap flushes in this repo did not appear to have any positive effect.

~/Projects/filecoin/lotus$ git diff --cached
diff --git a/cmd/lotus-bench/main.go b/cmd/lotus-bench/main.go
index 0b8ec6fe3..ed600a9f1 100644
--- a/cmd/lotus-bench/main.go
+++ b/cmd/lotus-bench/main.go
@@ -559,9 +559,25 @@ func runSeals(sb *ffiwrapper.Sealer, sbfs *basicfs.Provider, numSectors int, par
                                        trand := blake2b.Sum256(ticketPreimage)
                                        ticket := abi.SealRandomness(trand[:])

+                                       var c int
+                                       var pc1o storage.PreCommit1Out
+                                       var err error
+
                                        log.Infof("[%d] Running replication(1)...", i)
                                        piece := []abi.PieceInfo{pieces[i]}
-                                       pc1o, err := sb.SealPreCommit1(context.TODO(), sid, ticket, piece)
+
+                                       for {
+                                               if c >= 3 {
+                                                       break
+                                               }
+                                               log.Infof("[%d] Running replication try count %d ...", i, c+1)
+                                               pc1o, err = sb.SealPreCommit1(context.TODO(), sid, ticket, piece)
+                                               if err == nil {
+                                                       break
+                                               }
+                                               c++
+                                               log.Errorf("[%d] Error %w", i, err)
+                                       }
                                        if err != nil {
                                                return xerrors.Errorf("commit: %w", err)
                                        }

cryptonemo commented 2 years ago

Interesting, ok thanks. We'll have to try to get to the bottom of this and find a way to reproduce this within proofs.

mnachmia31 commented 2 years ago

@cryptonemo if it helps here is the script we are executing to initiate benchmarks that run into the issue:

#!/bin/bash -xe
cd $(dirname $0)
PREFIX=${PREFIX:-/mnt}
MD_PATH=${MD_PATH:-${PREFIX}/md0}
export FIL_PROOFS_PARENT_CACHE=${PARENT_CACHE:-${PREFIX}/parent_cache/64gib/}
export FIL_PROOFS_PARAMETER_CACHE=${PARAM_CACHE:-${PREFIX}/param_cache/64gib/}
export FIL_PROOFS_USE_MULTICORE_SDR=0
export FIL_PROOFS_MAXIMIZE_CACHE=0
export FIL_PROOFS_MAXIMIZE_CACHING=0
export FIL_PROOFS_VERIFY_CACHE=1
#export RUST_LOG=TRACE
export RUST_LOG=DEBUG
export TMPDIR=$(dirname ${FIL_PROOFS_PARENT_CACHE})/tmp/

mkdir -p ${FIL_PROOFS_PARENT_CACHE}
mkdir -p ${FIL_PROOFS_PARAMETER_CACHE}

mkdir -p ${TMPDIR}

logPath="../$(basename $(pwd))-log"
mkdir -p $logPath
dt=$(date "+%F-%H-%M-%S")

echo Run WITH packet drop | tee ${logPath}/job_run_log-$dt
drop_pers="0.01 0.02 0.03 0.04 0.05"

for p in $drop_pers ;do
    iptables -A INPUT -i enp129s0 -m statistic --mode random --probability $p -j DROP
    iptables -A OUTPUT -o enp129s0 -m statistic --mode random --probability $p -j DROP

on_exit()
{
    rc=$?
    echo "$0: Shutting down.. cleaning up"

    set +e

    if [[ -z $drop_pers ]];then
        return $rc
    fi
    for p in $drop_pers ;do
       iptables -D INPUT -i enp129s0 -m statistic --mode random --probability $p -j DROP
       iptables -D OUTPUT -o enp129s0 -m statistic --mode random --probability $p -j DROP
    done

    return $rc
}

trap on_exit 0 1 2 3 6 15

dmesg > $logPath/demsg-before-$dt

cd lotus

time cgexec \
        -g memory:lotusbench numactl \
        --interleave=all ./lotus-bench sealing \
        --storage-dir ${MD_PATH} \
        --sector-size 64GiB \
        --no-gpu \
        --skip-commit2 \
        --skip-unseal \
        --parallel 6 \
        --num-sectors 6 2>&1 | tee -a ../${logPath}/job_run_log-$dt

cd -

dmesg > $logPath/demsg-after-$dt

mnachmia31 commented 2 years ago

@cryptonemo did you have any chance to investigate this further?

cryptonemo commented 2 years ago

Unfortunately, not yet

mnachmia31 commented 2 years ago

@cryptonemo checking in again. Any updates?

mnachmia31 commented 2 years ago

@cryptonemo bumping this again. Any updates?

cryptonemo commented 2 years ago

@mnachmia31 We still don't have a way to reproduce this -- thus no progress on the issue.

mnachmia31 commented 2 years ago

@cryptonemo it's been a bit of time. Have you been able to reproduce this on your end yet?

cryptonemo commented 2 years ago

@cryptonemo it's been a bit of time. Have you been able to reproduce this on your end yet?

No, I haven't seen this consistently here at all. We regularly test on locally attached storage (primarily) as well as network attached storage (mostly when we run out of space locally).

vmx commented 1 year ago

There doesn't seem to be a way to reproduce this issue reliably, hence I'm closing this issue. If someone finds a way to reproduce, please re-open this issue.

filecoin-project / rust-fil-proofs