HugPuddle / Apertus

Apertus - Store data, browse and communicate on your favorite Blockchains.
http://Apertus.io
MIT License
34 stars 11 forks source link

compress duplicate transaction id's in Ledger file #23

Open embiimob opened 8 years ago

embiimob commented 8 years ago

after ledger file is created process the file to replace duplicate transaction id's with a transaction id and count of duplicate rows separated by a delimiter.

Example: 46a022b8d2b5a74946a56dfb4d35c6da8df391eace2288f30b136b9229583c97 46a022b8d2b5a74946a56dfb4d35c6da8df391eace2288f30b136b9229583c97 46a022b8d2b5a74946a56dfb4d35c6da8df391eace2288f30b136b9229583c97 46a022b8d2b5a74946a56dfb4d35c6da8df391eace2288f30b136b9229583c97

would be replace with: 46a022b8d2b5a74946a56dfb4d35c6da8df391eace2288f30b136b9229583c97:4

Update file building process to interpret the :4 to mean repeat the data 4 times in a row.

baffo32 commented 8 years ago

More ideas: gzipping the whole file, storing transaction ids in binary rather than hex

embiimob commented 8 years ago

I agree..in addition to the simple ledger compression algorithm above. I have considered Gzipping both the ledger and main data files by default. It really wouldn't be much work and will help prevent bloat.

embiimob commented 8 years ago

Do you think keeping the ledger files human readable outweighs the compression benefits?

xloem commented 8 years ago

(edit: oops, I am baffo32 as well as xloem)

Unless you need it for debugging, I would ignore human readability for ledger files. But I would make their design very simple, so that others can make tools that work with them.

I don't think ASCII (which they are currently in) is the way to go when space is at such a premium. It would be smaller if binary data were written directly. A sample format might be [4 bytes magic token][16 byte block of data] [20 byte blocks of data ...]. Depending on the magic token, the data may be gzip compressed.

Another idea: include the first 20 bytes of the txid (or a hash of the data !) as one of the outputs. Then this hash can be looked up rather than the transaction ID, and ledger file entries get 12 bytes smaller each. If a hash of the data is used, then ADD can be used as a content-addressable store, for which there are a lot of applications already.

If you do rewrite the format, it would be great if you could produce some notes on what the new format is. Every couple months I look into making some kind of apertus library for non-C#. I'm still learning the formats myself.

xloem commented 8 years ago

Additionally, ledger files are currently stored in a hierarchy, such that the first ledger file is small, and references subsequently larger and larger ledger files to deal with the high space limitations. This is a lot of metadata. Files would store in fewer transactions if the ledger files were stored as a linked list. The first ledger file lists the size of the file, optionally a table of contents, and the first few bytes of data. The last entry in the first ledger file is a reference to the second ledger file. The second ledger file stores just data, and the last entry is a reference to the third ledger file. Etc. This would require fewer total transactions.

embiimob commented 8 years ago

It's awesome to have another brain thinking about such things. We are hoping to eliminate the ledger file all together as a token (address) chain could be used to record all transactions instead of the ledger file tree. (this is how we are currently handling Keywords and Profile file changes) We are really just waiting for crypto community to catch up to what we are doing.. Implementing this now would require a full re-scan of the blockchain for each file fetch. which would be painfully slow. When Crytpo currencies decide to add an optional address index then address transactions could be queried much faster without any ledger files !! Once implemented the url to an etching or collection of etchings would simply be a wallet address! :D

embiimob commented 8 years ago

I added some images to the wiki that gives the current file and message formatting. Notice the 0000 padding this is in place to ensure that the data portions of files and messages always begin at byte 1 of the address. So if the file data is the same it will generate the same addresses when etched independent of the file name. this helps to locate duplicate copies of the same data. Also the delimiters are currently randomly picked from the subset listed... this was put in place originally to make it more difficult to filter out Apertus.io data from standard bitcoin data. They are all human readable characters usually not permitted in most file naming strategies.