deroproject / derohe

DERO Homomorphic Encryption Blockchain Protocol
Other
207 stars 82 forks source link

Optimize storage of the blockchain data #87

Open Robyer opened 2 years ago

Robyer commented 2 years ago

I think there are 2 main problems with the blockchain data storage right now: 1) enormous amount of files 2) extremely small size of these files

Filesystem has some cluster size ( = as a smallest unit of data that can be addressed on the disk) and if your files are smaller than that then the remaining space is wasted as it can't be used for anything else. For example NTFS has default size of 4 KB. Most of the files in Dero mainnet folder has only 101 bytes (I think that is single transaction or single block with no transactions in it). That means that every such file that Dero saves is wasting 3995 bytes from that 4 KB, which is 97,5 % of wasted space for these files.

Not mentioning fragmentation and slow access speed (for reading, writing, deleting) as opposite when you have one large file.

Also this high amount of separate files is also a problem itself, as the number of files the filesystem can hold is also limited.

I see that in mainnet/balances folder are larger files - each having 2 GB. I don't know what these files are (wallets data?), but this is much more effective / ideal approach. You don't have 1 file per wallet either.

Possible solution: Don't save each block/transaction as single file, but combine multiple blocks into single file. It could be even defined by some constant in code, how many transactions combine together. So if needed, advanced user can modify it according to his needs. Or it could be made dynamic, combining less blocks when there are more transactions.

E.g. combine each 1000 blocks into single file. Right now Dero has 734252 blocks, so it would be 735 files. Right now these transaction are like 16 GB, so it would make each file like 22 MB, which is nice even for sending it over network. Also, file with 1000 empty blocks would be like 100 kB.

Everything would be smoother with less problems, while still having ability to easily rsync the data, and copying files would be much faster.

gab81 commented 1 year ago

yes, in addition here's a scenario i recently encountered: I recently moved Dero's CLI to another drive for backup, can happen right? and it took FOREVER, 4.6 million files in total - no joke - with my 16GB computer sucking up memory, had to constantly free it up during the copy, it worked though. Then Windows started indexing all of them as well, creating a massive index database file, i had to delete later, fine. I totally agree with what Robyer suggested and hope something is done on this, make it more efficient :)

thanks, Gab

lcances commented 5 months ago

Adding my 2 cents to this (I know it is old that also a note to my to do list).

I would like to propose Hierarchical Data Format version 5 (HDF5). It is commonly used to store large dataset and each HDF file can be seen as a key : value local database.

Each HDF files could be a collection of 10 000 blocks, which will result in 368 files of 0.29 GB (as today), and the advantage are multiples:

The implementation is not that hard, but testing and ensuring reliability could take some time. And during the entire process both the old files and the new files would need to coexist (doubling the required storage space)

Step for implementation could be (most likely different way to do it, but that how I will do it)

  1. Refactor current tx / block IO mechanisme to use an interface
  2. Implement HDF solution using the exact same interface
  3. Add a parameter to start the node using HDF of single files
  4. Create a tool to copy the files into HDF files