buckyos / CYFS

CYFS is the next-generation technology to build real Web3 by upgrading the basic protocol of Web (TCP/IP+DNS+HTTP),is short for CYberFileSystem. https://www.cyfs.com/, cyfs://cyfs/index_en.html.
https://www.cyfs.com/
BSD 2-Clause "Simplified" License
1.99k stars 276 forks source link

GC mechanism based on full scan #155

Open lurenpluto opened 1 year ago

lurenpluto commented 1 year ago

Consider adding a GC mechanism, which has been planned for a long time and is a relatively complicated mechanism, so first complete the first version based on a full scan

GC mainly includes two types of data:

The relevant points are as follows:

  1. If a path is marked as virtual, it will not enter the next level of scanning, so during the scanning process, it is necessary to apply rmeta to each level of path to determine whether the path is virtual

  2. The default non-objectmap leaf node object scans at most one-level reference objects, but this default configuration can be modified through object meta, so during the scanning process, object meta also needs to be applied to non-objectmap objects to determine the references that need to be scanned object depth

streetycat commented 1 year ago
  1. Is the GroupState included, or the GroupState is a GlobalState?

  2. I have learn the solutions for GC just a moment ago, did you mean it will use ZGC or similar.

My simple flow:

And the clients will access the CyfsStack without stop.

lurenpluto commented 1 year ago
  1. Is the GroupState included, or the GroupState is a GlobalState?
  2. I have learn the solutions for GC just a moment ago, did you mean it will use ZGC or similar.

My simple flow:

  • Increase the generation for every object when the GC start.
  • The GC thread will visit the objects mount in the state tree, and re-initialize the generation when it's visited. It's also occured when the client thread visit a object with old generation.
  • Remove the objects with great generation beyond the specified value.

And the clients will access the CyfsStack without stop.

  1. GroupState is a type of global-state, which is equivalent to the root-state in the Garbage Collection (GC) context. It needs to be independently scanned and marked.

  2. Similar to ZGC, the overall idea of GC is, "In summary, GC watches which objects can be reached from our application through a chain of references and frees up the ones we can't reach." The references here mainly include global-state and some key detached objects, but the GC strategies used may differ.

Your flow is almost right. The current approach under consideration is to scan, mark, and periodically collect. The marking occurs either when the object or chunk is scanned by the GC operation or when it is used. This process needs further refinement. For example, the current mechanism for updating the "last-access-time" when a chunk is used is missing, and this should be taken into account as well.

The GC (Garbage Collection) work should ideally be executed quietly during the idle time of the cyfs-stack, making it "unnoticeable" to external users.

streetycat commented 1 year ago

I see, it's great, I have used it to clear the timeout cache(SN, and Group), but it's simple.

I learned it theoretically today.

waterflier commented 1 year ago

I think some pseudocode is needed to show and help application developers understand this process.

And I believe, such as this process is related to the implementation of RootState. It should be because before there is no concept of Root State, we have no way to accurately distinguish between Object saving and caching.

Since it is strongly related to the operation of Root State, there are two types of GC

TYPE I : post-scan based

The post-scan is also similar to the current Root State backup, starting from the root path, traversing all paths and marking the corresponding objects as "valid". Objects not marked as valid can be deleted after the scan is complete.

During the scan, the system is read-only.

TYPE II: based on real-time operation

We have APIs for adding objects and deleting objects to the Root State, and in these two APIs, the reference counts of objects are operated (can be asynchronous). When the system has no unscanned operations to add objects, it can execute the GC function to delete objects whose reference count is less than 1.

Perhaps we can further illustrate the difference between these two directions through pseudocode.

The system itself has strong and consistent synchronization and backup requirements, and it already has the implementation of TYPE I.

The implementation of TYPE II is related to the transactional nature of the Root State API. Optimization can achieve "only scan the changed part". It seems that it can reduce the time of system frozen, but there may be some insurmountable difficulties?

streetycat commented 1 year ago

I learn a word free root in this issue.

does it mean as the global roots here? How they will behave on GC?

lurenpluto commented 1 year ago

I learn a word free root in this issue.

does it mean as the global roots here? How they will behave on GC?

roots in the GC mechanism is a very key role, you can refer to the following introduction document

java-gc-roots

  1. GC Root Definition Let's first define what GC roots are. GC root is a term used in the context of garbage collection in Java. As the name suggests, GC roots are starting points for the garbage collector processes. In general, all objects directly or indirectly referred from a GC root are not garbage collected.

In our GC system, roots include all global-state roots and a list of free objects (which may be outside of the global-state)

lurenpluto commented 1 year ago

I think some pseudocode is needed to show and help application developers understand this process.

And I believe, such as this process is related to the implementation of RootState. It should be because before there is no concept of Root State, we have no way to accurately distinguish between Object saving and caching.

Since it is strongly related to the operation of Root State, there are two types of GC

TYPE I : post-scan based

The post-scan is also similar to the current Root State backup, starting from the root path, traversing all paths and marking the corresponding objects as "valid". Objects not marked as valid can be deleted after the scan is complete.

During the scan, the system is read-only.

TYPE II: based on real-time operation

We have APIs for adding objects and deleting objects to the Root State, and in these two APIs, the reference counts of objects are operated (can be asynchronous). When the system has no unscanned operations to add objects, it can execute the GC function to delete objects whose reference count is less than 1.

Perhaps we can further illustrate the difference between these two directions through pseudocode.

The system itself has strong and consistent synchronization and backup requirements, and it already has the implementation of TYPE I.

The implementation of TYPE II is related to the transactional nature of the Root State API. Optimization can achieve "only scan the changed part". It seems that it can reduce the time of system frozen, but there may be some insurmountable difficulties?

The impact of rmeta and object meta

Another key concept in GC is rmeta and object meta, which will directly affect the result of GC.

So whether it is a full scan or an incremental scan based on global-state change, it can be done based on the above meta information, which is where the complexity of GC lies

GC mechanism based on generation

During GC full scan, it is not necessary to set the whole global-state to read-only mode: on the one hand, the full scan may take a long time, and setting it to read-only will have too much impact on the whole system; on the other hand, it is not necessary, because the full scan is based on the snapshot of global-state, and if changes occur during the scan, it should not affect the result of the GC. However, there is a point to note is that based on the result of one GC scan, all unrelated objects can not be cleared, but should be based on the garbage tagging and recycling of generations, there are several principles need to be observed as follows:

  1. Each full scan, set the generation marker for all the objects/chunks referenced in that scan
  2. Perform full scans periodically, so the generation markers of the same object/chunk may change and be recursive
  3. Cyfs-stack upper layer needs to update the corresponding last_access_time each time the object/chunk is read
  4. The real GC operation is to periodically clear all objects/chunks that meet the following conditions
    • last_access_time is less than the specified date, such as a day or a week
    • The generation marker is less than the specified generation, either by generation version (incremental) or by GC scan date

GC scan based on global-state real-time changes

From the reference-based GC principle mentioned above, the object level should not be based on reference count, but on whether the generation marker and last access time are scanned; incremental GC scan based on global-state real-time changes can be a supplement to the above full-scan GC, but only the new sub tree needs to be scanned, and there is no need to scan the removed sub tree and decrement the reference count.