Limitations of IndexedDB?

nolanlawson commented 8 years ago

Hi there, I just came across this repo due to a tweet from @jakearchibald. I noticed you're also the Lovefield author (hello from PouchDB :smiley:), so I'm wondering: what prompted this proposal? Clearly you have experience building a relational database on top of IndexedDB, so I'm wondering what in particular you found that IndexedDB was not capable of, and why it would require a new web API.

I did see your five points against IndexedDB listed in the current spec, but 3 of them seem potentially solvable in an updated IndexedDB v3 spec (schema update, quota management, and cross process). For the other 2 (memory management, indexing bottleneck), I'd be curious to know more about where exactly you hit these bottlenecks in IndexedDB, if it was limited to particular browsers, etc.

PouchDB is not very relational (closest we've gotten is relational-pouch), but looking over the spec of RDB, I don't personally see much that couldn't be implemented in IndexedDB as-is. If I were to take a stab at it myself, I would probably do:

one single objectStore
one single index
keys prefixed by "store" name, \u0000 as separator for lexicographic ordering (pretty easy to just disallow this character in store names)
secondary indexes managed in a similar way, with prefixes
relations managed in-document, with IDB transactions responsible for ensuring no orphans

Obviously this is a pretty brute-force approach to the problem, but my guess is that the benefits (schema migrations possible, getAll() can be used to fetch any number of records in one query, works in Safari and IE despite their broken implementations) would outweigh the detriments (single objectStore means IDB can't smartly allow concurrent access to unrelated "stores", key size will be quite large). There might be other drawbacks I haven't thought of.

Anyway, I just want to make sure we've fully exhausted the possibilities with IndexedDB, before we start inventing new web storage APIs. Every time I tell someone about browser storage, I feel like I have to preface my explanation with, "well, there was A, but it was deprecated for B, and now we have C, but it's not well-supported, and oh yeah, there's also D, E, and F..." If I have to go all the way to Z, then nobody's going to want to listen to me anymore. :stuck_out_tongue:

IndexedDB was designed to be low-level enough that you could write high-level APIs like the one proposed here. If it's failed in that regard, then we really need to have an honest conversation about it.

arthurhsu commented 8 years ago

This is a long thread and contains a lot of noises. If you want a quick summary, check FAQ.

Thanks for spending a lot of time writing the feedback/issue. My goal is to provide a small scale relational DB for web app. I created Lovefield to solve the issue believing Indexed DB can be used for that.

My work with Lovefield shows that it's possible to create something that works to a certain degree. If we want to talk about feature parity or performance benchmark with SQLite or Microsoft SQL Compact on traditional platforms, there are a bunch of things that need to be "fixed" in Indexed DB spec, JavaScript, here and there.

One thing always frustrates me is that how many years and working groups are we talking about here to get every fix needed so that we can shoehorn a relational DB into an object DB that was not designed nor created for that? Which major browser implementation does not have relational DB internally? Why create more structures on top of something that does not fit simply because IndexedDB standard is there? I believe this RDB proposal can be implemented more quickly and easily with either SQLite or Microsoft SQL compact instead of trying to "fix"/alternate existing IndexedDB and JavaScript. Existing small-scale SQL engines are mature and field proven, and the majority of the efforts to provide RDB capability are actually political (sorry for the term, but standardization is just another form of political procedure).

Oh yes JavaScript, how can I forget this beast. I need an easy way to measure the heap size of an object. I need an easy way to swap pages/objects from/to disk. Basically I need more control to do memory management. This is simply contradictory to what JavaScript is designed for and dent me hard when I try to make a relational query engine that uses least amount memory possible.

Indexed DB is at best a shoehorned back store for relational DB. Relational DB is designed around paged store with atomic writing, and that's why it uses complex B-Trees and other designs to make sure the I/O and memory can be most efficient. These are already researched and proved for decades. I don't care if "we've fully exhausted the possibilities with IndexedDB", because fundamentally it is not designed as a page store, and JavaScript is not designed for packing/unpacking structured data into/from blobs. IndexedDB is an object store, it has great potential for duplication / synchronization working with NoSQL technology. It's cool, sounds webby, but not what I'm looking for. I need efficient relational DB with the power of SQL or SQL-like queries, so that I can create applications that need the power of relational DB. Worst of all, everyone is FORCED to use IndexedDB because this is the only viable cross platform persistent storage. Even worse, it creates excuses for killing other competing standards like WebSQL or this one I'm proposing. (Don't get me wrong, WebSQL is also very flawed. I spare its corpse since it's dead.)

Think about it. IndexedDB needs a "v3". From developer's point of view, I don't see much value supporting such an unstable spec except that's my ONLY choice. From a browser vendor's point of view, WTF we have another version to implement and I still have beeping dozens of bugs outstanding? (Sorry Josh ... I back stabbed you a bit :D ) As you said, and I beg whoever has influence on the W3C standards committee, we really need to have an honest conversation.

Face it. Relational DB is a different kind. Simulating it using object DB just creates bad user/developer experiences. A divorce is needed. We should make IDB focused on what it is designed for, and create RDB separately.

This is the comment from Lovefield creator, a guy who built RDB over IndexedDB with his cube mate. Feel free to treat it as rant. It is. I used IndexedDB since 2012 and I'm a certified ranter.

nolanlawson commented 8 years ago

I enjoyed the rant! :) I've definitely seen what kind of performance gains can be had from working with SQLite directly (especially for highly relational data) vs IndexedDB. In my tests with PouchDB, I've found that WebSQL frequently outperforms IndexedDB in Chrome, even though PouchDB is nominally a NoSQL database (our underlying schema ended up being pretty relation-y; this may change with our next IndexedDB adapter, though).

Again, my main worries about RDB are the cognitive overhead for new web developers, and that we'll end up with yet another poorly-supported database API (caniuse.com/indexeddb always makes me sad). I also worry that ultimately the performance will not actually be much better, or that it will vary wildly across platforms.

It also seems a bit silly to me that RDB appears to basically be a sugar API for building up SQL queries to then pass to WebSQL. If we want to revive WebSQL, but get rid of the possibility of SQL injection, then why don't we just call a spade a spade and say we want WebSQL 2.0? ;)

Anyway, jokes aside, I still think it would be worthwhile to take a shot at implementing your proposal here in IndexedDB, if for no other reason than to demonstrate that IndexedDB is insufficiently performant to pull it off. It'd also be useful to have a polyfill if/when this proposal becomes a W3C standard. :)

arthurhsu commented 8 years ago

Lovefield is the polyfill of this proposal, or to be fair, 90% of it. So I know where the limit is and the chance of "shot". Again, it's doable, but we need to be honest as computer scientists. Shoehorn RDB on top of IDB is fundamentally flawed.

I'm not too worried about developers learning new things. You can check Microsoft LINQ, which has been there for a while. LINQ has been well perceived and this RDB API is actually very similar to LINQ. It's a sugared SQL and silently constraint the usage to simple transactions. I don't pitch RDB as "WebSQL 2.0" because WebSQL has not define what SQL capabilities should an engine support. This is not good. A W3C standard has to be crystal clear about what's expected and what algorithm will be used. Leaving these critical factors undefined makes it doom.

yathit commented 8 years ago

I does not see web developers ask for relational database. Most web developers want simple but powerful query database API (multiple AND, OR, ORDER, full text search). Currently only Lovefield library provide rich query API.

Column base schema definition with type is two steps backward from current IndexedDB spec.

arthurhsu commented 8 years ago

As a platform, if web is going to attract more people to use/develop on it, it cannot ignore the grand popularity of relational database. Think about it, why Android and iOS both have relational database built-in as part of the API?

I don't know how many developers yathit represents, but that's not what I see when I promote Lovefield. Web developers are using server-side solutions today (e.g. PHP/MySQL is one popular way) to solve their relational query needs, because there's nothing on the client side.

NOTHING.

"Most web developers want simple but powerful query database API", yep, because there's no way to do it. So developers want to humbly ask something. The fact that Lovefield is the only library providing rich query API is also reinforcing the issue. IndexedDB makes it really hard to have someone build the library five years after its commencement.

Also, Lovefield is column base schema definition with type. I'll be very interested to know if there is a way to implement a generic query engine on a mobile browser, featuring efficient and powerful data queries without type information. I've looked for that unicorn for many years and could not find it.

freshp86 commented 8 years ago

How is the database engine going to optimize the ANDs and the ORs and the predicates presented in a query, if it has no (or minimal) knowledge about the data itself?

I am detecting a contradiction here. A query API is meant to query structured data, where as schema-less databases favor unstructured data (storing blobs, serialized JSON objects, protobufs etc). The advantages of an object-oriented DB are well documented and so are the disadvantages. I am failing to see how the advantages are relevant within a web application running in a browser, where as I have been hit by the disadvantages multiple time as a web developer.

Shoehorning a query API for structured data on top of an object-oriented DB seems like a step in the wrong direction, forcing developers to rely on server solutions to perform their rich queries in SQL-esque frameworks they are already familiar with.

Last thing to note is that querying structured data does not necessarily imply persistence of data on disk. Lovefield provides that option (memory-only DB) and it has proved to be a way more popular feature than initially expected.

yathit commented 8 years ago

Data type is not necessary. Key can be serialized (as defined in indexeddb spec) and work with it.

Mongodb has rich query API. It does not require column base rigid schema definition. It is fine without relational aspect.

If possible, it is better to incremental to the indexeddb spec rather than complete rewrite proposal.

arthurhsu commented 8 years ago

There are at least two misconceptions from your reply

1) MongoDB can replace every relational database. 2) RDB proposal is going to kill IndexedDB

Answers to these misconceptions are no and no. MongoDB is a document-based database, which has its strength and weakness. Assuming one type of DB can be the silver bullet and solve everyone's problems is not the right thing that a platform should do. If we can't agree on this, there's no basis for conversation.

What I'm proposing here is to provide a second choice. RDB should live side-by-side with IndexedDB. A web standard is good only when it provides features people need. IndexedDB provides features that are great for object/document-based usage, but not relational. RDB proposed here is to define a clear (sub)set of APIs that handles relational data just like SQL.

inexorabletash commented 8 years ago

Sorry Josh ...

No worries. I still ❤ you guys.

BTW, re: IDB "v3" - there's not really even a "v2" yet; the rough model spec model we're following is to just add new stuff to a living doc, then every once in a while take a snapshot and edit it down as necessary to a new version. We just haven't actually done that yet; everything in there is aspirational on top of the published "v1" (TR); that we've slipped out "getAll()" an a few other things represents how spec stuff happens in practice - cooperative evolution between the spec and implementations.

Anyway, back to the rants!

yathit commented 8 years ago

1) MongoDB can replace every relational database.

MongoDB do not have cross table join. It only have self join. That is good enough for most application.

MongoDB cannot (and will not) replace relational database.

2) RDB proposal is going to kill IndexedDB

It is already out. Nobody can kill IndexedDB.

Few web application may found relational database useful. As you see relational database come with a lot of schema specification and complex query api. Will RDB spec will have good ROI? Browser vendor are looking for small API foot print with bigger impact. RDB spec will not.

I think JSON schema definition and Promise in query result is good enough in your proposal.

freshp86 commented 8 years ago

Few web application may found relational database useful.

Is there any data to support this statement? It is hard to measure how many people would use something that does not currently exist, so this is simply an opinion. The closest measurement I can think of is to measure how many native apps on mobile make use of similar APIs.

Also a comparison like this needs a point of reference. "Few" compared to current IDB usage? Or "few" compared to the total number of websites in the wild? We can't also measure how many people are not using IDB because they are actually dealing with relational type of data instead.

One could also say that very few websites use any kind of persistence (not just IDB, but localStorage too and friends). One thing to keep in mind is that RDB proposal though is not only referring to use cases where persistence is required, so the target group of potential use cases is much larger than both IDB and WebSQL and other persistence solutions.

yathit commented 8 years ago

Let us go with many web application need relational database. It still ROI or usefulness of API for RDB is not great.

There is many low hanging fruit in IndexedDB spec to make API easier to use. Promise, URI base query, JSON schema definition, etc.

arthurhsu commented 8 years ago

Again, IDB is not created for relational use. Why the developers have to shoehorn their need with it? Also I disagree with the statement regarding ROI or usefulness. Please use numbers/data, or list some obvious citations (which I failed to see) to back these statements.

I can quickly provide a counter example of ROI using Lovefield's demo. Pure IDB version: https://github.com/google/lovefield/blob/master/demos/moviedb/demo-pureidb.js Lovefield's version: https://github.com/google/lovefield/blob/master/demos/moviedb/demo-jquery.js

Readability, ease of change, ease of maintenance, these are all valuable to developers. Writing such a simple app using so many lines with IDB is not good for ROI. Keep in mind that the links above are just a simple toy application. Consider implementing apps handling heavily relational data like tax preparation, financial applications, enterprise productivity suites, ... forcing the shoehorning is simply destroying user's ROI for developing web apps.

That's my point: a platform shall not force user to use one paradigm over the other for no good reason. "IndexedDB can be used" is not a good reason. Academic studies already showed us the strength and weakness of object DB and relational DB. I don't see why forcing everyone to use object DB to simulate relational DB behavior is considered useful and has high ROI.

arthurhsu commented 8 years ago

Also for the browser vendor part, I had an honest conversation with Jake because this is actually the first question he popped. It won't be hard to implement, the problem is the spec looks gigantic (even though I haven't really started it, from the example code we knew the spec will be long).

A quick reasonable implementation is to use the RDB JS API to generate SQL statements, and leverage existing engine embedded in the browser (i.e. SQLite in Safari, Chrome, and Firefox. For Microsoft Edge, it's really whether they want to pull in some more DLLs/.Net assemblies or not.) The SQL statement generation is not complicated (see polyfill: https://github.com/google/lovefield/blob/master/lib/query/to_sql.js), and I expect most of the efforts will be quite boiler-plate.

If any vendor want to go full-fledge (i.e. create a real database from scratch), then yes that would be a significant investment decision to make. Even so there are quite a few existing C++ class libraries to start with, so it's not from zero. At least one don't really need to write B-Tree from scratch as I did for Lovefield.

mathiasrw commented 8 years ago

A quick reasonable implementation is to use the RDB JS API to generate SQL statements, and leverage existing engine embedded in the browser

:+1:

nolanlawson commented 8 years ago

Great discussion! :)

@arthurhsu If the plan is for browser vendors to very tightly link the RDB query generator to the underlying database, then what's to prevent the same problem we had with WebSQL, where the implementation was tied too deeply to SQLite, and so MS would be forced to reimplement the many quirks of SQLite from scratch?

For example, how will the engine handle string collation? Will SQLite run in UTF-8 mode? Will there be any guarantees about the performance of tricky cases like subclauses? How should it handle binary data?

Again, I still feel like IndexedDB provides enough of the low-level building blocks (let's ignore the fact that it's a document store for awhile – it can store string keys, and as @yathit points out, you can serialize anything into strings). I know it's old hat by how, but the extensible web manifesto is still relevant here IMO.

I think for this spec to make a really strong case, it would be nice to see two implementations: one on top of IndexedDB, and one on top of WebSQL, and then demonstrate that the IndexedDB version is unreasonably slow for a normal use case. After all, the main argument in favor of RDB is performance, so it'd be nice to see some numbers around that. :)

(By the way, I fully expect WebSQL to win, but it'd be nice to know by how much!)

arthurhsu commented 8 years ago

The model of proposed RDB is very different from WebSQL. WebSQL did not define behaviors of the queries. It just define how to pass a SQL statement and how to get back the results from execution. This has many fundamental flaws (okay I'm now flogging the corpse of WebSQL)

SQL has so many dialects, which can do things slightly differently across implementations (sounds familiar for Web standards)
SQL needs parsing, which causes security issues (e.g. SQL injection) and subtle quirks for corner cases
SQL types are not really mapped to JS types
SQL has very complicated transaction models, and WebSQL seems to consider only a subset of it without clear documentation
More ...

On the other hand, RDB tried to define the queries to be performed using JS API, therefore it aims to define a clear set of behavior from JS layer. It's up to the browser vendor to decide how to implement these behaviors, but it will not have the collation/subclauses/transaction model problems that plagues WebSQL.

Also, I need to repeat my arguments (maybe using bullet form)

RDB aims to simplify relational data manipulation for web apps
- Easy-to-use APIs to promote readable, maintainable, and performant code for apps needing relational data
- No need to shoehorn relational data model on top of object data model
- No need to implement domain-specific query layer or to use big libraries (e.g. Lovefield)
Low level building block fallacy
- Fundamentally the algorithms are different (that's the reason for repeating term "shoehorn")
- One storage structure cannot satisfy all
- One indexing structure cannot satisfy all
- "You can use" with the cost of complexity and performance penalty, which drove users to other platforms (bad for web) or server-side solutions (cost for users)
Limitations for implementing a generic query engine in JS
- Library is unavoidably big (Lovefield ~130KB, SQLite with web assembly is ~1.5MB)
- Limited/Non-existing memory/storage management
- Low level building blocks are posing unnecessary limitations (auto-commit, alter schema, excess eventing, ...)
Just non-sense
- Browser carries RDB implementation in C++ without exposing it
- The idea of building/fixing more low-level blocks to improve IDB and shoehorn RDB on top of it, when there's already something in the C++ layer
- Making IDB more bloated with the features that object DB does not need

nolanlawson commented 8 years ago

Thanks, that makes a lot more sense! I think my original worries still stand (not very "extensible web," vulnerable to cross-browser bugs, one more thing for webdevs to learn), but I'll be interested to see what an IndexedDB- or WebSQL-based implementation of RDB looks like. If we can prove that the IDB version is unreasonably slow (because of the "shoehorning" you describe), then I think that will make a great case for this spec.

(BTW Lovefield seems to be ~42KB minified+gzipped... so not really that huge. Give yourself a little credit. :wink:)

arthurhsu commented 8 years ago

I'm very careful not to fall in the endless loop of "significant performance". It's provable but takes a lot of efforts. I do have some benchmark code to compare IndexedDB with WebSQL, but could not disclose it due to NDA. Rough number is that IndexedDB takes way more JS heap and runs in less than half of the speed for the benchmark (50K rows with simple joins). I can try to write a separate benchmark without NDA limitations, HOWEVER, it's always pain in the rear to write IndexedDB code, even for experts like me.

See the problem here? It's always pain in the rear to write IndexedDB code, even for experts like me.

That's the "unnecessary complexity" part I keep trying to address. If you don't like Lovefield for any reason, the only choices you have are either use server-side solutions, or bite the bullet and share the pain.

The other things I could not do with Lovefield are reducing memory usage and shorten initialization time. Lovefield caches all data in memory and that's an unfortunate design trade-off to be made. I'm hoping the RDB spec and C++ side support can make things more efficient using much less memory.

yathit commented 8 years ago

It is interesting how memory trade off pay out using Lovefield. On the other hand, alternative implementation is to use temporary indexeddb index instead of caching in memory.

arthurhsu commented 8 years ago

Creating B-Tree index (which IndexedDB uses internally) on the fly for 50K rows will cause non-negligible pause, which will make the library not usable in UI applications. Also IndexedDB write performance will be impacted if the indices are permanent instead of temporary.

The trade-off is to load everything in the beginning and create B-Tree in JS heap at the moment. Two advantages: most applications have a loading facade because they need to download data from server, so it will be less destructive to user experiences; another is to make IndexedDB read/write much faster, because we can use getAll() (which creates much less events) and IndexedDB simply put objects into its store, which is what it's designed for.

yathit commented 8 years ago

Creating B-Tree index (which IndexedDB uses internally) on the fly for 50K rows will cause non-negligible pause, which will make the library not usable in UI applications.

All database transactions (Indexeddb or WebSQL) run in background thread. There is no pause in UI thread. But I will not use 50K rows in single transaction. Typically I will load 200 records per http request (default in Google GData API) and put into database in a single transaction.

Also IndexedDB write performance will be impacted if the indices are permanent instead of temporary.

No. I do not see IndexedDB has performance problem I tested on hundreds of thousands of rows. Query return less then 20 ms.

Two advantages:

Two disadvantages: 1) inconsistent data across pages 2) can crash any time due to out of memory.

So given the current browser API, the tradeoff is not good in my opinion.

arthurhsu commented 8 years ago

All database transactions (Indexeddb or WebSQL) run in background thread. There is no pause in UI thread. But I will not use 50K rows in single transaction. Typically I will load 200 records per http request (default in Google GData API) and put into database in a single transaction.

Running in background thread does not mean not blocking UI thread. The OS can give you only one thread (which is common on low-end hardware). If the data model can't be filled fast enough to handle UI scroll request, chances are the UI is still effectively blocked by the thread.

I do agree no 50K rows in a single transaction and paging shall be used. IndexedDB is very terrible on this for pagination. One must have very intimate knowledge of the key used, or use auto increment key so that key range can be used to achieve pagination. But let's go back to benchmarking. If we run TPC-C on IndexedDB, this 50K transaction will be normal. TPC-C is designed to torture database as bad users of the library. It would be fun to see how TPC-C runs against IndexedDB and WebSQL (though I don't think it's fair. IndexedDB is an object DB).

No. I do not see IndexedDB has performance problem I tested on hundreds of thousands of rows. Query return less then 20 ms.

Thank you for providing such a good example that why proving in "performance" is an endless loop. Also this can be used to answer the concern that @nolanlawson has. The main thing of performance is that we need a commonly accepted benchmark to end debate like this. @yathit has a beefy machine that can update hundreds of thousands of rows within 20ms. I don't. My SATA disk can transfer at most 4.8Gbit/s, which means 600MB/s. For 100K rows with 1K payload each, we are talking about 100MB disk transfer, which takes my SATA drive at least 167ms. I'm not even counting other overheads such as IndexedDB cursor loop events.

Two disadvantages: 1) inconsistent data across pages 2) can crash any time due to out of memory.

That's assuming transaction management is not in place. Lovefield does have its own transaction management to avoid crash caused inconsistency. Lovefield's transaction management even handles external changes, for example, you can use Firebase as a backstore and Lovefield will nicely handle changes from other clients. The real issue is that underlying IndexedDB does not provide any change notification across process so I did not see how inconsistent data across pages can be solved with or without caching.

yathit commented 8 years ago

we are talking about 100MB disk transfer, which takes my SATA drive at least 167ms.

You are talking about loading all records into memory. I will never load hundreds of thousands of records into memory. If you need such query, you are doing it wrong.

20ms is typical time taken for most laptop (object store has hundreds of thousands of records, but query is limited to few hundreds result). Since Indexeddb has keyrange, limit and offset, there is not need to query more than 100 records at a time.

Firebase as a backstore

Consistency beyond browser is hardly goals for browser API. Indexeddb transactions are consistent across tabs.

The real issue is that underlying IndexedDB does not provide any change notification across process

Database changes notification is another few use case with large performance degradation and increase complexity. Most of these use cases are for convenient - rather than necessary. There is WebStorage API which provides low volume notification. IndexedDB use case will be too noisy.

yathit commented 8 years ago

Shoehorn RDB on top of IDB is fundamentally flawed.

It is not flawed. A RDB library can be built using IDB without performance and consistency problem that Lovefield have.

arthurhsu commented 8 years ago

It is not flawed. A RDB library can be built using IDB without performance and consistency problem that Lovefield have.

I would be glad to see such a library and I'm pretty confident on the performance of Lovefield. Show me the implementation and numbers. I'm very curious to see complete feature parity and benchmark results.

arthurhsu commented 8 years ago

You are talking about loading all records into memory. I will never load hundreds of thousands of records into memory. If you need such query, you are doing it wrong.

20ms is typical time taken for most laptop (object store has hundreds of thousands of records, but query is limited to few hundreds result). Since Indexeddb has keyrange, limit and offset, there is not need to query more than 100 records at a time.

Well I think the original context for my comment is based on creating a temporary B-Tree index, and you replied that "I do not see IndexedDB has performance problem I tested on hundreds of thousands of rows. Query return less then 20 ms". Can one create a B-Tree index without full table scan? Even if it's done in C++ it still needs to read everything before creating an index.

Database changes notification is another few use case with large performance degradation and increase complexity. Most of these use cases are for convenient - rather than necessary. There is WebStorage API which provides low volume notification. IndexedDB use case will be too noisy.

Web standards shall be moving web forward, and the worst attitude for W3C to have is "these use cases are for convenient - rather than necessary" because A) convenience can be the major driving force for user adoption and productivity B) spec working groups are not omniscient nor omnipotent, and are supposed to be humble and hear user needs

It's very obvious a good set of APIs will make everyone's life much easier, and that's what I'm asking for data-heavy app developers.

yathit commented 8 years ago

If you use IndexedDB API for query, you don't need full table scan nor creating B-Tree. Query on IndexedDB API having hundreds of thousands of records do take less then 20ms on most laptop. This is the result we should be comparing for performance. Not the time taken for full table scan, which never occur.

I use multiple cursors for multiple AND self-join query in combine with compound index. I have web application using over 30 object stores with 200 indexes. There is full text search and complex join query with ordering. Some users have hundreds of thousands of records in the database. For join between object stores, I use relationship table and join them by composite key. I don't see performance problem in IndexedDB. All queries are possible directly on IndexedDB query without resolving into on memory buffer. My chrome extension run non-stop since some user set 1 minutes alarm to sync between Google calendar and SugarCRM meeting records. Chrome task manager show 90 MB usage by the extension. To my use case IndexedDB meets the expectation.

It will be great if somebody make a useful benchmark between the two libraries.

arthurhsu commented 8 years ago

Lovefield's numbers are here https://google.github.io/lovefield/dashboard/src/dashboard.html.

For sure selecting a page can be 20ms, not a problem at all. You can see Lovefield had the same performance in the case of selecting 10 rows. As I keep emphasizing, I care more about ease to use and productivity. Let's use ydn db's join code as an example here. Say I want to do this.

SELECT * FROM article WHERE license = 'SA' AND publisher = 'Nature' ORDER BY title

In ydn this somehow simplified to the following (well, code is quoted from http://dev.yathit.com/ydn-db/doc/query/key-joining.html, I did not write them)

var license_sa_iter = ydn.db.IndexIterator.where('article', 'license, title', '^', ['SA']);
var publisher_nature_iter = ydn.db.IndexIterator.where('article', 'publisher, title', '^', ['Nature']);
var match_keys = [];
var solver = new ydn.db.algo.ZigzagMerge(match_keys);
var req = db.scan(solver, [license_sa_iter, publisher_nature_iter]);
req.then(function() {
  db.values('article', match_keys).done(function(results) {
    console.log(results);
  });
}, function(e) {
  throw e;
});

In my proposed RDB this become

var a = db.table('article');
db.select().from(a).where(a.license.eq('SA').and(a.publisher.eq('Nature')).orderBy(a.title)
  .commit().then(function(results) {
  console.log(results);
}, function(e) {
  throw e;
});

It's quite obvious which one is more intuitive to code, easier to read, easier to maintain, and easier to change. This will be even more interesting if we change the SQL statement to

SELECT article.name, author.name
  FROM article ar, author au
  WHERE ar.license = 'SA' AND ar.publisher = 'Nature' AND ar.authorId = au.author
  ORDER BY ar.title`

In fact, the ydn way is what I tried to avoid when I create Lovefield. I know things can be done that way and can be efficient, provided that the users of the library are also database experts and know edges and wrinkles of every corner. Not everyone is capable or lucky enough to have that kind of people in their team to solve their problem.

Productivity is the problem that RDB is targeted for. RDB is proposed to improve productivity of average web developers and empower them to write data-rich web apps.

agershun commented 8 years ago

@arthurhsu There is a error somewhere in the page: https://google.github.io/lovefield/dashboard/src/dashboard.html

arthurhsu commented 8 years ago

@agershun I'm seeing it fine on Windows 7 with both Chrome and IE. Maybe you need to check your plugins?

agershun commented 8 years ago

@arthurhsu Sorry, it was low disk memory problem. Now it works fine with Win8 Crome&FF, but not with IE10. But thiis is ok for me.

yathit commented 8 years ago

I have create a simple benchmark for three AND equi-self-joint query with ordering.

For 50,000 records, YDN-DN take ydn: 12891.64ms on Firefox and 5728.010ms on Chrome for above query on my two year old Macbook Pro laptop.

I am not able to get Lovefield work. Any PR for optimized Lovefield query is highly appreciated.

Lovefield with lots of boiler plate code

var schemaBuilder = lf.schema.create('benchmark-lf', 1);

schemaBuilder.createTable('article').
addColumn('id', lf.Type.INTEGER).
addColumn('license', lf.Type.STRING).
addColumn('publisher', lf.Type.STRING).
addColumn('year', lf.Type.INTEGER).
addColumn('title', lf.Type.STRING).
addPrimaryKey(['id']).
addIndex('licensetitle', ['license', 'title'], false, lf.Order.ASC).
addIndex('publishertitle', ['publisher', 'title'], false, lf.Order.ASC).
addIndex('yeartitle', ['year', 'title'], false, lf.Order.ASC);
var todoDb, article;
schemaBuilder.connect().then(function(db) {
    todoDb = db;
    article = db.getSchema().table('article');
    var rows = [];
    for (var i = 0; i < data.length; i++) {
        rows.push(article.createRow(data[i]));
    }
    return db.insertOrReplace().into(article).values(rows).exec();
}).then(function() {
    console.time('lf');
    return todoDb.select().from(article)
        .where(article.license.eq('SA')
            .and(article.publisher.eq('Science'))
            .and(article.year.eq(2006)))
        .orderBy(article.title).exec();
}).then(function(results) {
    console.log(row);
    disp('Lovefield' + ' ' + new Date().toLocaleTimeString() + ' ' +
        row.length + ' articles from ' + row[0].title + ' to ' + row[row.length - 1].title);
    console.timeEnd('lf');
});

YDN-DB looks simple and readable

var ydb = new ydn.db.Storage('benchmark-ydn', {
    stores: [{
        name: 'article',
        keyPath: 'id',
        indexes: [{
            keyPath: ['license', 'title']
        }, {
            keyPath: ['publisher', 'title']
        }, {
            keyPath: ['year', 'title']
        }]
    }]
});

ydb.putAll('article', data);

   console.time('ydn');
var iters = [ydn.db.IndexIterator.where('article', 'license, title', '^', ['SA']),
    ydn.db.IndexIterator.where('article', 'publisher, title', '^', ['Science']),
    ydn.db.IndexIterator.where('article', 'year, title', '^', [2006])];
var match_keys = [];
var solver = new ydn.db.algo.ZigzagMerge(match_keys);
var req = ydb.scan(solver, iters);
req.then(function() {
    ydb.values('article', match_keys).done(function(row) {
        console.timeEnd('ydn');
        console.log(row);
        disp('YDN-DB' + ' ' + new Date().toLocaleTimeString() + ' ' +
            row.length + ' articles from ' + row[0].title + ' to ' + row[row.length - 1].title);

    });
}, function(e) {
    console.error(e);
});

nolanlawson commented 8 years ago

Creating B-Tree index (which IndexedDB uses internally) on the fly for 50K rows will cause non-negligible pause

@arthurhsu is correct; IndexedDB does heavily block the DOM, if used in the main thread. I have an article and a demo that explores this problem in-depth. The worst blocking is seen in Firefox and Chrome.

For this reason, I have recently been advising folks to only use IndexedDB inside of a web worker. The serialization costs between the worker and the main thread are negligible (especially if you use JSON.stringify/parse) and the gains in framerate are enormous.

This doesn't have a huge impact on the topic at hand (I assume RDB would be runnable inside a worker?), but it's worth pointing out.

arthurhsu commented 8 years ago

@yathit You are off topic. How about comparing IndexedDB with SQLite or Microsoft SQL compact? That's what I'm proposing.

Also the feature parities are not the same. Your code has nothing to do with data integrity (in this case, primary key are not enforced as a constraint). I don't mind people promoting their library, that's what I do also. I'd like to ask everyone on this thread to stay on topic and keep in mind that we are talking about a standard proposal that aims to promote the web platform and developer productivity.

yathit commented 8 years ago

Fixed incorrect Lovefield query. Speed is impressive.

@agershun add AlaSQL in the benchmark.

11:30:15 AM Load ALA: 227 ms 50000 articles from ab51f365 to 57f4b6fc
11:30:22 AM Load YDN: 7445 ms 50000 articles from ab51f365 to 57f4b6fc
11:30:44 AM Load LF: 21305 ms 50000 articles from ab51f365 to 57f4b6fc

11:30:44 AM LF: 73 ms 20 articles from 00d64a0f to 1a29c6e5
11:30:44 AM YDN: 292 ms 20 articles from 00d64a0f to 1a29c6e5
11:30:44 AM YDN2: 5 ms 20 articles from 00d64a0f to 1a29c6e5
11:30:44 AM ALA: 27 ms 20 articles from 00d64a0f to 1a29c6e5

agershun commented 8 years ago

@yathit We will. Thank you!

Note: Please, take in account that AlaSQL caches results of parsing and compilation of SQL statements for the same SQL statement. Probably, another test with diverse SQL statements can give another results.

freshp86 commented 8 years ago

FYI, I made some minor changes to the benchmark mentioned two comments above, see https://github.com/freshp86/lovefield-ydndb-benchmark/commit/707605bcd521492b984a8b5c43181b5fcd0f9bc0. (need to manually call runLf() from the dev tools).

Lovefield's SELECT performance is on average (run 5 times) at 16ms (tried on Google Chrome 48.0.2564.97 (Official Build) (64-bit)).

yathit commented 8 years ago

Thanks @freshp86 Updated. Now Lovefield and AlaSQL run same speed.

Since enough rants has been on IndexedDB, can I have Lovefield rants?

How on earth am I suppose to know what indexes are required in Lovefied? There is no documentation on index requirement. Apparently (looking at the commented code), @freshp86 used Lovefield explain command to check require index. Why not automatically generate require query index, as it did with AlaSQL? Why query run without index, but with slow speed? Will indexing be in the specification?

Regarding productivity, the primary feature of the proposal, Lovefield syntax is not productive at all. I never use AlaSQL, but I wrote above query without looking at the documentation. Why lovefied to introduce a new concept of lf.op.and. BTW, there is no documentation for lf.op.and.

The security issue that you mentioned in SQL don't exist in AlaSQL. Is parametrized query not a simple fix? Do you really need the whole new query language to fix security issue?

For completeness I have add AlaSQL below:

var db = new alasql.Database('mybase');
db.exec('CREATE TABLE article (id INT PRIMARY KEY, license STRING, publisher STRING, year INT, title STRING)');
for (var i = 0; i < data.length; i++) {
  var row = data[i];
  db.exec('INSERT INTO article VALUES (?, ?, ?, ?, ?)', [row.id, row.license, row.publisher, row.year, row.title]);
}

var result = db.exec('SELECT * FROM article WHERE license = ? AND publisher = ? AND year = ? ORDER BY title ASC LIMIT 20',
  ['SA', 'Science', 2006])

arthurhsu commented 8 years ago

To @nolanlawson and people who are interested in potential performance difference for the proposed RDB and IndexedDB, I've created a simple demo to show that.

http://arthurhsu.github.io/rdb/demo/demo.html

This demos a simple but super common usage of relational database: a master-slave view. What it does is to join three tables and show the details result. The key number here is when we click a grid row, how long it takes to generate the details data.

On my Macbook Pro, IndexedDB is about 70~200ms, WebSQL is 5ms~9ms, and Lovefield is 5ms~15ms. Remember this is just for a single row and it's order of magnitude difference between IndexedDB and WebSQL. If you're curious, you can also profile the process memory usage and JS heap usage. You shall see why I care so much about memory management, and keep ranting on not being able to do that in JS.

I hope this demo can provide a strong enough case from the performance side.

freshp86 commented 8 years ago

@yathit: lf.op.and, lf.op.or documentation can be found at https://github.com/google/lovefield/blob/master/docs/spec/04_query.md#413-search-conditions.

Regarding the question of "why introduce a declarative API instead of parsing strings?" Quite a few reasons.

Avoiding parsing, which is a security hole and source of trouble is just one of them.
Type-checking. Javascript is changing in many ways, and one of those ways is to provide an (optional) typing system that helps developers (for example Closure Compiler, Typescript). Today's JS apps are much bigger and harder to manage than 10 years ago. A declarative API allows type-checking, for example see https://github.com/google/lovefield/blob/master/dist/lovefield.d.ts.
A declarative API makes it crystal clear of what is supported and what is not, compared to passing free text to a generic exec() method and hoping for the best. (Also how about typos on free text?)

Lovefield syntax is not productive at all.

I respect your opinion, but can't agree. Having talked with lot of developers on various events (with the biggest one being Google IO 2015), the reaction to the declarative API (also known as the builder pattern) is quite the opposite. The builder pattern is nothing new on the web (for example the very popular d3.js library is also using it extensively).

Regarding adding indices automatically, it is a possibility for the future but IMO this is two steps ahead, when the Web is missing even the most basic RDBMS features. Providing a capable developer with the tools to optimize the DB schema according to his/her own unique needs (by manually adding indices) is higher priority than trying to second guess the developer and auto-magically add indices.

arthurhsu commented 8 years ago

BTW, I have learned an interesting productivity number today.

An undisclosed project I know (eh, that's lame, but I can't say anything) spent 4 years and 5 engineers to consistently implement/maintain an IndexedDB implementation (because every time they need to add a new column or an index, they need to change the domain-specific query layer built on top of it). They've fed up with it and switch to Lovefield. One engineer in four weeks get everything migrated, and maintenance is way easier: one engineer's spotty check is good enough. Adding/removing columns? Peace. Adding/removing indices? Peace. Modify the way of joining and sorting? Peace.

I'm pleased to know that they are happy to have it. I would like to see if I could disclose this project in the future and hopefully let the community realize how much productivity difference can be with this RDB API.

yathit commented 8 years ago

@freshp86 Thanks. Could you put Lovefield API doc in https://google.github.io/lovefield/ My bad that I did not hard enough to search on github blob.

freshp86 commented 8 years ago

Lovefield documentation is reachable by the GitHub project page at https://github.com/google/lovefield, there is no need to look for blobs manually, one can just navigate from the README.md.

yathit commented 8 years ago

@arthurhsu I am sure Lovefield will be loved by Closure Tools users. But most JS developers will not.

@freshp86 Builder pattern is powerful on statically typed language like Java. I don't see it in browser and JS API.

yathit commented 8 years ago

@freshp86 what i means was, your documentation is not ready for public consumption. For example there is no way to show me that lf.op.and can take more two arguments. How it compose with other logic operator is not clear. Like most developers, I am impatient of learning yet another ORM library for query optimization. Can we just use SQL?

freshp86 commented 8 years ago

I feel this thread has severely derailed the original topic. I'll respond one last time to off-topic comments.

there is no way to show me that lf.op.and can take more two arguments

Screenshot from the docs (note the word "variable") and

your documentation is not ready for public consumption.

Demos (six of them) https://github.com/google/lovefield/tree/master/demos

Detailed specs and design documents: https://github.com/google/lovefield/tree/master/docs

Codelab (voted among the best at GoogleIO) https://github.com/google/lovefield/tree/master/codelabs/stock_analyzer

Long video presentation: https://www.youtube.com/watch?v=wRiI4p5Uk4E

Short video presentation: https://www.youtube.com/watch?v=S1AUIq8GA1k

Detailed changelog for every version: https://github.com/google/lovefield/blob/master/dist/CHANGELOG.md

Developer forum: https://groups.google.com/forum/#!forum/lovefield-users

Again, I respect your opinion but can't really disagree more.

yathit commented 8 years ago

@freshp86 I googled and browsed several time on the web site to figure out how to use lf.op.and. I did not know it could take more then two arguments.

I also confess that I did not read the lovefield documentation when writing above code. I did found the readme page that you show me. But just scan through to figure out how to use lf.op.and. Definitely I was the one to blame.

mathiasrw commented 8 years ago

@yathit

I also confess that I did not read the lovefield documentation when writing above code

Please contribute to a fruitful dialogue by putting some time into what you are commenting.

Since enough rants has been on IndexedDB, can I have Lovefield rants?

No, not here. Please create a separate issue for inputs not closely related to "Limitations of IndexedDB". You risk being seen as a troll if you contribute with inputs unrelated to the headline of the issue.

nolanlawson commented 8 years ago

@arthurhsu Thank you for the demo; very enlightening! Quick question: isn't Lovefield using IndexedDB under the hood? Or is it in-memory in this case? I'm confused a bit by comparing IndexedDB vs WebSQL vs Lovefield.

arthurhsu / rdb

Limitations of IndexedDB? #2