ArcadeData / arcadedb

ArcadeDB Multi-Model Database, one DBMS that supports SQL, Cypher, Gremlin, HTTP/JSON, MongoDB and Redis. ArcadeDB is a conceptual fork of OrientDB, the first Multi-Model DBMS. ArcadeDB supports Vector Embeddings.
https://arcadedb.com
Apache License 2.0
494 stars 60 forks source link

Reuse RID of deleted records #1391

Open lvca opened 10 months ago

lvca commented 10 months ago

Discussed in https://github.com/ArcadeData/arcadedb/discussions/1203

Originally posted by **lvca** August 12, 2023 ArcadeDB does not reuse the assigned RIDs. Never. Even when the record is deleted. This has the advantage of determining if a link is broken (the RID points to a non-existent record), useful in case of unidirectional edges. In other cases reusing the RIDs allows reusing all the space in pages. In some use cases that are heavily based on deletes, this feature can save a lot of space and improve overall performance. I'd like to make this configurable, so the user can decide per bucket/type if the RIDs are reusable.

Phase 1 (scheduled for 24.1.1):

Phase 2 (it will be separated into a new issue):

lvca commented 10 months ago

This branch is already in good shape. The records are fully recycled, and I optimized the record management on the page. We have some statements in the documentation that we never recycle RIDs, so we should prepare the docs for this change in 24.1.1.

Still missing one point to make the recycle fully working and the last as optional in this phase to speed up things, but not necessary for the 1st release.

tolgaulas commented 10 months ago

@lvca, as I see you're started considering space economy, a very "most-wanted" missing feature of orientdb was to be able to compact files.. Can this be a case in ArcadeDb anytime soon?

lvca commented 10 months ago

@tolgaulas Correct, ArcadeDB will be able to reuse 100% of the space over time. This first issue will be able to reclaim all the space used and deleted. Indexes already supported compaction since the beginning. What is still missing (after this issue) is:

  1. To properly recover the space occupied by shrunk multi-page records (a record spans over multiple pages and then shrinks in size)
  2. Double-check if the space is 100% reclaimed in the linked list used by edges

(Updated the main issue with these new sub-tasks)

lvca commented 10 months ago

@tolgaulas You're right about OrientDB, some structures didn't allow to reuse the whole allocated space, especially with indexes and ridbags.

lvca commented 7 months ago

This PR is in good shape. The missing issue to start thinking about a possible merge (still to test how well works with old legacy databases) is how to manage deleted RIDs in index.

Our LSMTree is highly optimized for ArcadeDB. Any LSMTree has 2 parts (1) the last page which is mutable, and (2) all the other pages are immutable. In ArcadeDB the records that have been deleted have an entry on a more recent page with a negative RID to mark the record as deleted.

Since now record ids are reused, this causes a conflict if a record has been inserted, removed and inserted back.

lvca commented 6 months ago

After further testing the index is fine, it already supports multiple entries with the same RID. The problem is with the test case ConsoleAsyncInsertTest.testOrderByAfterDeleteInsert which uses the async API. For some reason, the new records inserted after a delete are not updating one index.

lvca commented 4 months ago

Some news. It's nothing related to Asynch, but seems more about how the index manages recycled RIDs when they end up on multiple pages. Still digging into the specific issue. This could be the last issue before doing some porting tests with old databases.