Possible memory leak with test data

farzadbekran commented 2 years ago

I use the code below to add 200,000 tuples to a test relvar. After execution, project-m36-server uses about 1GB of RAM. Once I quit the server and start it again, it uses about 2.4GB of RAM.

(Sorry about the code wall, I moved everything relevant to the same file for ease of test.)

Test repo is here

To reproduce, start ghci in the DB module, load it and run:

> genTestData
Right ()
> getTestData
Right 100

After that if you check, you'll see that project-m36-server is using about 1GB of RAM If you close the server and start it again, it will be using about 2.4GB

agentm commented 2 years ago

Thanks for the detailed writeup. I'll try to reproduce this and pin down the leak.

farzadbekran commented 2 years ago

This is profiling info for starting up the server with a db containing an arbitrary relation created with this tutd command:

createarbitraryrelation a {a Int, b Text} 200000-200000
:commit

From what I understand so far, most of the time and memory is spent in ProjectM36.DatabaseContext.hashBytes while calculating the merkle hash.

Allocation FlameGraph: 0-alloc-project-m36-server

Time FlameGraph: 0-time-project-m36-server

Heap Map: project-m36-server-heap

Raw .prof file: gen-project-m36-server.prof.txt

Also useful:

  10,090,770,848 bytes allocated in the heap
  45,567,033,064 bytes copied during GC
     481,265,440 bytes maximum residency (142 sample(s))
       7,321,656 bytes maximum slop
             458 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9520 colls,     0 par    0.842s   0.850s     0.0001s    0.0012s
  Gen  1       142 colls,     0 par   53.143s  53.589s     0.3774s    0.6089s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.004s elapsed)
  MUT     time   19.728s  (887.275s elapsed)
  GC      time   26.616s  ( 26.975s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time   27.370s  ( 27.464s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   73.715s  (941.720s elapsed)

  Alloc rate    511,500,784 bytes per MUT second

  Productivity  63.9% of total user, 97.1% of total elapsed

agentm commented 2 years ago

Thanks for the report! I've been meaning to play with BangPatterns. I suspect that we have a lot of unnecessary laziness in the Atom definitions. I'll start there and see how the profiles change.

agentm commented 2 years ago

Just for the record, I think that the original bug opened here is resolved by f1fe611e1bc5d08f2e2b32695169f6aa826cc9f3 which fixes an issue whereby the disconnected transaction used to take a long tuple list forward with it.

We should still investigate how to reduce memory usage further.

agentm commented 2 years ago

Currently, Project:M36 clients validate the transaction graph via the merkle hashes at connection time. How would you feel about offering the option for clients to skip the validation for trusted Project:M36 servers?

farzadbekran commented 2 years ago

That would be nice, but doesn't the server calculate the hashes on every transaction or is it not the case? Also what would be the risk of skipping it, if its not calculated on every transaction?

agentm commented 2 years ago

That would be nice, but doesn't the server calculate the hashes on every transaction or is it not the case? Also what would be the risk of skipping it, if its not calculated on every transaction?

Indeed, it's a fair question. The goal of the Merkle hashing is the same as in other blockchain technologies: a verifiable trail. Many databases don't have such a feature, so it's natural to ask if it's necessary. I have struggled with the requirements myself, so I am open to making Merkle hashing optional with the caveat that that would close the door to features such as:

a returning client being able to verify that the past database states have not changed
a server loading a known state from a database directory
ensuring that a backup is not corrupted
a client can validate the server state separately from the server
a client can be aware when a database has been deleted and recreated

Basically, as in a blockchain, these are features that allow a third party to place less trust in the database server.

But I can certainly understand that someone would be willing to trade the merkle validation for some performance improvement.

farzadbekran commented 2 years ago

I get what you're saying, but there are a few things that come into my mind regarding this:

In a blockchain, access to the chain is public, therefore the need for keys and hashing. In a software if someone else has access to your db, you got bigger problems.
In a blockchain, malicious activity has a huge cost attached to it, either hash-power costs or staking.
Assuming we make access to db files public for whatever reason, there is no way to detect if some bad actor has messed up the db without having some form of security. (proof of work, proof of stake, public/private keys, etc)

In a nutshell, what I'm saying is, hashing alone is not enough to secure a public access database, and if it's not supposed to be public, then the clients are from the same organization, which then means that the clients have to trust the server anyway (it's not like the clients can take over or something, unless we somehow make that a possibility).

It might help me get a better perspective if you provide an example of the scenario you have in mind in which having Merkle hashing is a positive. I'm specifically interested in how the network would be configured server/client wise. (i.e is there a central server or each client is acting as a server with access to the db files?)

agentm commented 2 years ago

I think that this is an interesting debate. In "Out of the Tarpit", the authors make the strong case that the database and application server should be one entity. This removes hurdles; for example, with regards to the type impedance mismatch between database and app server. With Haskell server-side-scripting, this unification should be possible today and I wish to make further progress in making databases public with security improvements, as you suggested.

You are correct that in today's architectural model, databases are centralized software completely managed and trusted as a singular source of truth. However, that often makes them a single source of failure, too, even with high availability features like replication and failover, for example, if the database is overwhelmed with requests.

With this in mind, Project:M36 has been making strides towards a distributed system more akin to git whereby a database can contain a subset of another database or selectively choose to replicate branches, etc. Therefore, no singular database is necessarily authoritative, so we need to some tools to determine in what state "somewhat trustworthy" remote databases are in; for example, to determine common ancestors in the transaction stream.

I think it would still be fine to offer the option to disable hashing, both on server startup and client side connection.

agentm commented 2 years ago

In 54c822152c71efe71e9e9834a90129cf6ee46798, I fixed an unnecessary bytestring strictness when writing the database to disk.

In 0b008bab8ac13f000b41e4ad336a55228787535f, I add BangPatterns to the most common Atoms which reduces data structure indirection and thunks. In the 200000-tupled arbitrary relation, this cuts down on memory usage by 10%.

There is probably other low-hanging fruit to find.

farzadbekran commented 2 years ago

Given your additional notes, to implement something like git I think these are required:

Branches should support public/private access levels, possibly with a set of authorized clients to make pull requests
Each client should only be able to run transactions in their own branches
Each client should have a set of public/private keys that are used to sign the transactions
There should be a method to notify the clients about the changes in the master branch they use (current notification mechanism should suffice?)
Some form of pull request mechanism for clients (for merging changes from their branch into master)

I think if we manage to get this to work in a smooth, high performance way, the results should be ground-breaking! I know these are easier said than done, but we can try! Let me know what you think, this is too interesting!

YuMingLiao commented 2 years ago

I import a 17 attributes / ~200000 tuples relation. The memory usage jumps from ~2G to ~4G after committing, quitting, and re-entering. And typing in TutorialD becomes slow, which made TutorialD hard to use. In the meantime, the db takes only ~100MB disk usage.

So, I think a disable option would be much helpful for trying project-m36.

agentm commented 2 years ago

Given your additional notes, to implement something like git I think these are required:
* Branches should support public/private access levels, possibly with a set of authorized clients to make pull requests

* Each client should only be able to run transactions in their own branches

* Each client should have a set of public/private keys that are used to sign the transactions

* There should be a method to notify the clients about the changes in the master branch they use (current notification mechanism should suffice?)

* Some form of pull request mechanism for clients (for merging changes from their branch into master)
I think if we manage to get this to work in a smooth, high performance way, the results should be ground-breaking! I know these are easier said than done, but we can try! Let me know what you think, this is too interesting!

Indeed! I am trying to execute an incremental plan towards this where I think the actual design and architecture aspects are most critical, so I have pushed the security components that you mentioned to the final phase. It should be easier to test the merging/pulling systems in a semi-trusted environment.

Keep in mind that some of elements you mentioned above are actually features of a centralized github/gitlab-like architecture rather than git alone. For example, naked git does not support pull/merge requests. I think such features would be nice eventually, but I would like to experiment with the decentralized aspects first.

agentm commented 2 years ago

I import a 17 attributes / ~200000 tuples relation. The memory usage jumps from ~2G to ~4G after committing, quitting, and re-entering. And typing in TutorialD becomes slow, which made TutorialD hard to use. In the meantime, the db takes only ~100MB disk usage.

So, I think a disable option would be much helpful for trying project-m36.

There is likely some low-hanging fruit in further defragmenting the heap. Also, we plan to implement finite relations' tuple stores using streamly (currently a linked list) to take advantage of more parallelism and that could make lots of operations run off of disk-stored tuples run in constant memory.

agentm / project-m36

Possible memory leak with test data #300