Requirements - Githubissues

amark commented 5 years ago

First of all - great work, excited to have you doing this.

Secondarily, I'll be being pretty nitpicky cause I have very specific requirement/goals I need:

Docs API looks pretty cool / ideal looking.

Not using state may lead to bugs - it should be default, with option to replace.

Any data not in a timegraph (what you are calling timeplot) structure defeats the point/purpose of scale.

Unless I'm misunderstanding how you can achieve performance without it in that structure (please correct if wrong), this is why noted starting from lib/time's basis is important to begin with.

lib/time timegraph supports resolution up to 1ms "buckets" of resolution, not 1min, this is intentional.

Twitter does 8K tweets/sec. 1min bucket would mean nearly Half Million records that have to be loaded, potentially 24MB which is untenable.

Timegraph's performance is not designed to at most achieve Twitter level scale - Twitter is only 1 site! Decentralized web will be all sites, people, IoT, machines combined! Assume at least there may be 10K ~ 100K records in 1 bucket/resolution.

I like .time(start, end) with chaining. I also think some API for "scrolling" by amount (.first(X number)) is good, do note that those "amounts" might also need to be a range like time start/end, because in a chatroom for example, on load you probably want "last 20 chats" then as you scroll and viewport might want to do "last 10 ~ 30" , "last 20 ~ 30" so there is always a scrolling buffer. So the API would need to also be able to handle the idea of a start/end or first/last range dynamicaly update. So far I like your htinking more on this direction with chaining than the original timegraph.

Please look into what you call timeplotting as the more important thing though, at scale & scrolling.

Thank you! My ETA is still assuming this will take 1 month, just FYI so don't overpromise to yourself!!

bugs181 commented 5 years ago

Hey @amark sorry I missed this! Github didn't notify me. Hopefully I can clear some air here, as I think we're both on the same page.

The state thing was born out of the ideology that state should not be a dependent nor requirement. While Gun uses UTC for it's state with a lexicon, there may come a time when people want to implement their own deterministic state machine. Leaving the state open to further uses would be ideal, and I don't see much of an impact on performance going this direction. At most, you would see o*2 for .putwithout the master index. Currently the .put method fits into 40 lines of code and can probably be improved upon even further. Unless I'm simply being daft here, you would still need an o*n lookup to read the state, which would be the same performance using a separate index.

The TimePlotting structure was just a generalized idea so I could know if I was going in the right direction. Would be very easy to scale this to be as granular as we want. I've improved on the original algorithm for the TimeGraph inserts. As long as the .get and .put head don't affect performance, you'll be getting o*2 (two .puts without the master index).

If you'd like me to run some quick benchmarking, I'd need a good idea of what it is you want me to benchmark. Some sample data would be great, and I'll give a comparison to original versus this algo.

Thanks for the feedback on the API chaining. For the buffer, would you just want to do in memory array? It would only need to store the souls and timestamp to achieve the scrolling buffer. Adding one more, would .pop the original off the stack. I'm not sure if there are any perf hits with doing it this way.

1 month seems reasonable. I expect to having a working prototype implementation by New Year. Go big or go home, right? I try not to overpromise too much but I do it to push myself harder.

Thanks for taking the time to write the response.

amark commented 5 years ago

Not using state can lead to bugs though, NTS can do clock sync which UTC would be wrong.

Could you clarify O*2 more? if O is items-in-timegraph/list, then this would be very slow. The original lib/time is O(1) on writes & reads, meaning performance scales whether 10 items in list or millions or billions of items. Performance is worst case relative to ratio of month/day/hours which is constant, compared to data-set-size which is ever growing.

That is why I ask it be based off the original algorithm with implementation improvements, I don't want to waste time on re-solving the problem, just making it more featureful/bugs fixed. I leave creative solutions up to OSS/volunteer work, but wanting specific prototypes of mine improved as paid (if you read the science behind the bounty page, I'm trying to reward menial tasks like maintenance/etc. that most don't want to do).

Again, please forgive me for rapid/brief behavior. Keep up the awesomeness! :) :)

bugs181 commented 5 years ago

It's interesting, because after looking at how lib/time is implemented, I was pretty much ending up with the same result as you - just not as performant since I hand't gotten around to that part yet.

The difference is that you're using what I assume is the root node's .get name for creating your index, whereas I'm using the soul.

Original:

  chatroom:2018:12:27:03:52:07:351: [object Object] {
    _: [object Object] { ... },
    jq62nuev01awMOvXd4wwTfH: [object Object] { ... }
  },
  jq62nh5fKk0FvvXXyPW8: [object Object] {
    _: [object Object] { ... },
    alias: "test",
    message: "hello"
  }

Improved:

  timepoint/jq62nh5fKk0FvvXXyPW8/State [object Object] {
    _: [object Object] { ... },
    jq62nh5fKk0FvvXXyPW8: State
  },
  jq62nh5fKk0FvvXXyPW8: [object Object] {
    _: [object Object] { ... },
    alias: "test",
    message: "hello"
  }
}

I'm still not convinced that State will work in this scenario, although if possible for O(1) lookups, would be more performant than splitting/parsing the timestamps.

By O*2, I was referring to the amount of internal .puts. It appears your batching your .puts (using a single object being built off) to create the time indexes. I did a simple test and gained at least 70% performance when I grouped them into the single .put. It appears your .put internally would be O*8 (as you said perf is related to granularity for year, month, etc).

With that in mind, I'll be switching over to state to keep with the flow. It didn't take me long to 're-solve' the problem and basically re-invent the wheel, however figuring it out on my own has several advantages to making the rest of the API work, rather than building off of a black box.

I'll read the bounty resource you mentioned. I don't mind doing something that my boss tells me to ;) - it's how things get done.

No apologies necessary for the brevity of posts. I understand its busy busy around the holidays. Just enjoy family time!

bugs181 commented 5 years ago

Update on my end - still active. Improving upon the original lib/time implementation, I was able to still create a separate index. The newest version uses state and is compatible with .put and .set chain API. I still need to validate that the whole graph is being implemented but prototype testing gives these new results (each measurement is 1000 inserts; see below).

put: 246.110ms
set: 339.539ms
time original: 2565.808ms
timegraph.put: 1190.448ms
timegraph.time: 2485.516ms

It should be noted that timegraph.time is a shim for the original API and creates new souls for every .put. What's interesting here is that it seems to be slightly faster than the original. Perhaps because the soul checks aren't needed in the nested function. It's possible this could be due to incomplete graph. Will need to validate further but wanted to publish my results.

The benchmarking I'm using is just the internal console time API in node.

function measure(label) {
  console.time(label)
  let counter = testNum

  return function callback() {
    counter--
    if (counter === 0)
      console.timeEnd(label)
  }
}

And used like so:

let people = app.get('put')
let i = 0
let testNum = 1000
let cb = measure('put')

for (; i < testNum; i++) {
  people.put({ name: 'Mark' + i }, cb)
}

It ensures that all .puts behind the scenes have had all their callbacks called (ack) for accurate measurement.

bugs181 commented 5 years ago

Newest performance measurements from last commit:

put: 205.153ms
set: 328.205ms
time original: 2289.214ms
timegraph.put: 985.777ms
timegraph.time: 1245.997ms

Everything is being indexed properly AFAICT. .set(data), .set(node), .put(data), .put(node). Started working on the streaming events. Would like some confirmation on the direction.

.on - Returns actual data (requires O(2) lookups, one for the timegraph/soul and the other for the node itself. Appears to be the same as your chat example

.once - Same as above

.time(data) - Simply a proxy/shim for .set, to create new nodes for every insert.

.time(cb) - Uses O(1) lookup and provides the soul to the new item.

All of these will use the .range and other filters once available.

amark commented 5 years ago

Anything you need from me?

bugs181 commented 5 years ago

Nope. All is good. Just rolling out a performance lib for Mocha tests. Thank you though!

bugs181 commented 5 years ago

Update on my end. Refactored the chaining API and I'm much more happy with it. Will start adding range code back piece by piece. This was a fairly large refactor.

bugs181 / TimeGraph

Requirements #1