codex-storage / nim-codex

Decentralized Durability Engine
https://codex.storage
Apache License 2.0
69 stars 25 forks source link

[BUG] Codex hangs when creating a storage request for a big file using the REST api #877

Open 2-towns opened 3 months ago

2-towns commented 3 months ago

Describe the bug I uploaded an MP4 file of 583.7 MB size. When I am trying to create a storage request for this file using the REST API, the codex client hangs and I am not able to make any requests anymore.

To Reproduce Steps to reproduce the behavior:

  1. Upload a MP4 file with size ~ 500 MB
  2. Get the CID
  3. Create a storage request with these parameters:
    {
    "duration":"3600",
    "reward":"2",
    "proofProbability":"3",
    "nodes":"2",
    "tollerance":"0",
    "collateral":"10",
    "expiry":"900"
    }

Expected behavior The storage request is created and I receive the request ID. I am expecting not to be blocked while making the request.

Environment:

Additional context When the storage request expires, the client is not hanging anymore.

When I kill (Ctrl + C) the client during a hanging phase, the client crashes with these logs.

Traceback (most recent call last, using override)
/home/arnaud/Work/codex/nim-codex/codex.nim(142) codex
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncloop.nim(263) _ZN9asyncloop4pollE
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/node.nim(451) _ZN12setupRequest12setupRequestE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(287) _ZN7builder13buildManifestE12SlotsBuilderI10MerkleTreeI2FrI6staticIN18curves_declaration5CurveEEEN9poseidon216PoseidonKeysEnumEE2FrI6staticIN18curves_declaration5CurveEEEE
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(288) _ZN13buildManifest13buildManifestE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(257) _ZN7builder10buildSlotsE12SlotsBuilderI10MerkleTreeI2FrI6staticIN18curves_declaration5CurveEEEN9poseidon216PoseidonKeysEnumEE2FrI6staticIN18curves_declaration5CurveEEEE
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(270) _ZN10buildSlots10buildSlotsE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(218) _ZN7builder9buildSlotE12SlotsBuilderI10MerkleTreeI2FrI6staticIN18curves_declaration5CurveEEEN9poseidon216PoseidonKeysEnumEE2FrI6staticIN18curves_declaration5CurveEEEE25range09223372036854775807
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(231) _ZN9buildSlot9buildSlotE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(206) _ZN7builder13buildSlotTreeE12SlotsBuilderI10MerkleTreeI2FrI6staticIN18curves_declaration5CurveEEEN9poseidon216PoseidonKeysEnumEE2FrI6staticIN18curves_declaration5CurveEEEE25range09223372036854775807
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(212) _ZN13buildSlotTree13buildSlotTreeE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(171) _ZN7builder13getCellHashesE12SlotsBuilderI10MerkleTreeI2FrI6staticIN18curves_declaration5CurveEEEN9poseidon216PoseidonKeysEnumEE2FrI6staticIN18curves_declaration5CurveEEEE25range09223372036854775807
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(196) _ZN13getCellHashes13getCellHashesE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(136) _ZN7builder14buildBlockTreeE12SlotsBuilderI10MerkleTreeI2FrI6staticIN18curves_declaration5CurveEEEN9poseidon216PoseidonKeysEnumEE2FrI6staticIN18curves_declaration5CurveEEEE25range0922337203685477580725range09223372036854775807
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/slots/builder/builder.nim(157) _ZN14buildBlockTree14buildBlockTreeE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/stores/repostore/store.nim(73) _ZN5store8getBlockE3refIN5types25RepoStorecolonObjectType_EEN3cid3CidE25range09223372036854775807
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/stores/repostore/store.nim(74) _ZN8getBlock44getBlock
/home/arnaud/Work/codex/nim-codex/codex/stores/repostore/operations.nim(55) _ZN10operations15getLeafMetadataE3refIN5types25RepoStorecolonObjectType_EEN3cid3CidE25range09223372036854775807
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/codex/stores/repostore/operations.nim(59) _ZN15getLeafMetadata15getLeafMetadataE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/vendor/nim-datastore/datastore/typedds.nim(98) _ZN7typedds3getE3refIN7typedds30TypedDatastorecolonObjectType_EEN3key3KeyE
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/vendor/nim-datastore/datastore/typedds.nim(101) _ZN3get45get
/home/arnaud/Work/codex/nim-codex/vendor/nim-datastore/datastore/leveldb/leveldbds.nim(43) _ZN9leveldbds3getE3refIN9leveldbds32LevelDbDatastorecolonObjectType_EEN3key3KeyE
/home/arnaud/Work/codex/nim-codex/vendor/nim-chronos/chronos/asyncfutures2.nim(318) _ZN9asyncloop14futureContinueE3refIN7futures26FutureBasecolonObjectType_EE
/home/arnaud/Work/codex/nim-codex/vendor/nim-datastore/datastore/leveldb/leveldbds.nim(45) _ZN3get59get
/home/arnaud/Work/codex/nim-codex/vendor/nim-leveldbstatic/leveldbstatic.nim(230) _ZN13leveldbstatic3getE3refIN13leveldbstatic23LevelDbcolonObjectType_EE6string
/home/arnaud/Work/codex/nim-codex/vendor/nim-leveldbstatic/vendor/db/c.cc(206) leveldb_get
/home/arnaud/Work/codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(631) signalHandler
/home/arnaud/Work/codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(314) _ZN6system18rawWriteStackTraceE3varI6stringE
/home/arnaud/Work/codex/nim-codex/vendor/nimbus-build-system/vendor/Nim/lib/system/stacktraces.nim(59) _ZN11stacktraces30auxWriteStackTraceWithOverrideE3varI6stringE
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
./start-client-node.sh: line 9: 360563 Segmentation fault      (core dumped)
benbierens commented 3 months ago

When you POST to /storage/request/{cid}, the following work is being done by the application:

The hashing and erasure-coding a CPU-intense operations. None of this is being performed on a separate thread. If you have a large dataset, this process might take a while and the app will be completely unresponsive during that time. This is a real problem, not just for users but also: during this time the app doesn't respond to network traffic. Other nodes may consider your node lost, and drop you from their routing tables.

Needless to say, this is something client-team has to fix. An obvious approach would be to move the erasure-coding/slot-building to another thread. Ideally, the API would have a "work object" that represents the inputs, outputs, and status of this entire process, so that the user could have some feedback (and the POST itself wouldn't need to block till it's finished).

gmega commented 1 month ago

This will be addressed by: