Open farzadbekran opened 2 years ago
Thanks for the detailed writeup. I'll try to reproduce this and pin down the leak.
This is profiling info for starting up the server with a db containing an arbitrary relation created with this tutd command:
createarbitraryrelation a {a Int, b Text} 200000-200000
:commit
From what I understand so far, most of the time and memory is spent in ProjectM36.DatabaseContext.hashBytes while calculating the merkle hash.
Allocation FlameGraph:
Time FlameGraph:
Heap Map:
Raw .prof
file:
gen-project-m36-server.prof.txt
Also useful:
10,090,770,848 bytes allocated in the heap
45,567,033,064 bytes copied during GC
481,265,440 bytes maximum residency (142 sample(s))
7,321,656 bytes maximum slop
458 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9520 colls, 0 par 0.842s 0.850s 0.0001s 0.0012s
Gen 1 142 colls, 0 par 53.143s 53.589s 0.3774s 0.6089s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.004s elapsed)
MUT time 19.728s (887.275s elapsed)
GC time 26.616s ( 26.975s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 27.370s ( 27.464s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 73.715s (941.720s elapsed)
Alloc rate 511,500,784 bytes per MUT second
Productivity 63.9% of total user, 97.1% of total elapsed
Thanks for the report! I've been meaning to play with BangPatterns. I suspect that we have a lot of unnecessary laziness in the Atom definitions. I'll start there and see how the profiles change.
Just for the record, I think that the original bug opened here is resolved by f1fe611e1bc5d08f2e2b32695169f6aa826cc9f3 which fixes an issue whereby the disconnected transaction used to take a long tuple list forward with it.
We should still investigate how to reduce memory usage further.
Currently, Project:M36 clients validate the transaction graph via the merkle hashes at connection time. How would you feel about offering the option for clients to skip the validation for trusted Project:M36 servers?
That would be nice, but doesn't the server calculate the hashes on every transaction or is it not the case? Also what would be the risk of skipping it, if its not calculated on every transaction?
That would be nice, but doesn't the server calculate the hashes on every transaction or is it not the case? Also what would be the risk of skipping it, if its not calculated on every transaction?
Indeed, it's a fair question. The goal of the Merkle hashing is the same as in other blockchain technologies: a verifiable trail. Many databases don't have such a feature, so it's natural to ask if it's necessary. I have struggled with the requirements myself, so I am open to making Merkle hashing optional with the caveat that that would close the door to features such as:
Basically, as in a blockchain, these are features that allow a third party to place less trust in the database server.
But I can certainly understand that someone would be willing to trade the merkle validation for some performance improvement.
I get what you're saying, but there are a few things that come into my mind regarding this:
In a nutshell, what I'm saying is, hashing alone is not enough to secure a public access database, and if it's not supposed to be public, then the clients are from the same organization, which then means that the clients have to trust the server anyway (it's not like the clients can take over or something, unless we somehow make that a possibility).
It might help me get a better perspective if you provide an example of the scenario you have in mind in which having Merkle hashing is a positive. I'm specifically interested in how the network would be configured server/client wise. (i.e is there a central server or each client is acting as a server with access to the db files?)
I think that this is an interesting debate. In "Out of the Tarpit", the authors make the strong case that the database and application server should be one entity. This removes hurdles; for example, with regards to the type impedance mismatch between database and app server. With Haskell server-side-scripting, this unification should be possible today and I wish to make further progress in making databases public with security improvements, as you suggested.
You are correct that in today's architectural model, databases are centralized software completely managed and trusted as a singular source of truth. However, that often makes them a single source of failure, too, even with high availability features like replication and failover, for example, if the database is overwhelmed with requests.
With this in mind, Project:M36 has been making strides towards a distributed system more akin to git whereby a database can contain a subset of another database or selectively choose to replicate branches, etc. Therefore, no singular database is necessarily authoritative, so we need to some tools to determine in what state "somewhat trustworthy" remote databases are in; for example, to determine common ancestors in the transaction stream.
I think it would still be fine to offer the option to disable hashing, both on server startup and client side connection.
In 54c822152c71efe71e9e9834a90129cf6ee46798, I fixed an unnecessary bytestring strictness when writing the database to disk.
In 0b008bab8ac13f000b41e4ad336a55228787535f, I add BangPatterns to the most common Atoms which reduces data structure indirection and thunks. In the 200000-tupled arbitrary relation, this cuts down on memory usage by 10%.
There is probably other low-hanging fruit to find.
Given your additional notes, to implement something like git I think these are required:
I think if we manage to get this to work in a smooth, high performance way, the results should be ground-breaking! I know these are easier said than done, but we can try! Let me know what you think, this is too interesting!
I import a 17 attributes / ~200000 tuples relation. The memory usage jumps from ~2G to ~4G after committing, quitting, and re-entering. And typing in TutorialD becomes slow, which made TutorialD hard to use. In the meantime, the db takes only ~100MB disk usage.
So, I think a disable option would be much helpful for trying project-m36.
Given your additional notes, to implement something like git I think these are required:
* Branches should support public/private access levels, possibly with a set of authorized clients to make pull requests * Each client should only be able to run transactions in their own branches * Each client should have a set of public/private keys that are used to sign the transactions * There should be a method to notify the clients about the changes in the master branch they use (current notification mechanism should suffice?) * Some form of pull request mechanism for clients (for merging changes from their branch into master)
I think if we manage to get this to work in a smooth, high performance way, the results should be ground-breaking! I know these are easier said than done, but we can try! Let me know what you think, this is too interesting!
Indeed! I am trying to execute an incremental plan towards this where I think the actual design and architecture aspects are most critical, so I have pushed the security components that you mentioned to the final phase. It should be easier to test the merging/pulling systems in a semi-trusted environment.
Keep in mind that some of elements you mentioned above are actually features of a centralized github/gitlab-like architecture rather than git alone. For example, naked git does not support pull/merge requests. I think such features would be nice eventually, but I would like to experiment with the decentralized aspects first.
I import a 17 attributes / ~200000 tuples relation. The memory usage jumps from ~2G to ~4G after committing, quitting, and re-entering. And typing in TutorialD becomes slow, which made TutorialD hard to use. In the meantime, the db takes only ~100MB disk usage.
So, I think a disable option would be much helpful for trying project-m36.
There is likely some low-hanging fruit in further defragmenting the heap. Also, we plan to implement finite relations' tuple stores using streamly (currently a linked list) to take advantage of more parallelism and that could make lots of operations run off of disk-stored tuples run in constant memory.
I use the code below to add 200,000 tuples to a test relvar. After execution,
project-m36-server
uses about 1GB of RAM. Once I quit the server and start it again, it uses about 2.4GB of RAM.(Sorry about the code wall, I moved everything relevant to the same file for ease of test.)
Test repo is here
To reproduce, start ghci in the DB module, load it and run:
After that if you check, you'll see that
project-m36-server
is using about 1GB of RAM If you close the server and start it again, it will be using about 2.4GB