Closed obeliskos closed 6 years ago
Thank you obeliskos, I thought about similar things. I am also open to any breaking changes to LokiJS (which already exist). That's way LokiDB is still in beta.
by default we should probably continue to expect that users will throw dirty data at it
I would argue against this. LokiDB is for programmer/software developers and not for "script kiddies". I know most of the time which data I expect when dealing with databases. As you already said, TypeScript helps there a lot. ;)
I would prefer strict comparison should be the default behavior.
Talking about different modes for binary indices one could add it while creating a collection, like so:
const coll = db.addCollection("CoolStuff", {indices: ["id", {field: "name", strict: false}]}
Where the field id
would use strict comparision on binary indices and the name
field abstract/weak comparision.
Additionally, I would rename some operations so that $eq, $ne, $lt, $lte, $gt, $gte, $between would use the javascript strict comparision and $aeq, $ane, $alt, $alte, $agt, $agte, $abetween use weak comparision.
It may take a lot of effort but for that level of granularity, I think we should look into extracting as much binary index logic as possible and adding modes to that. I was originally thinking of 'db'-level option but collection or property (index-level) modes would be flexible.
Dates are another example which (unless we encode them ourselves) would need non-strict indices to account for the fact they are objects (never strict equal) but <, > will convert to string before doing string relational tests. Not to mention they cannot be serialized without being converted to string.
Also to reword my comments to be more accurate, javascript doesn't have a strict relational comparisons (<, >) unless you test typeof() yourself. But for most types (other than Dates), sorting can work because ===, <, and > will all 'agree'.
I am very sure, you understand the binary indices of LokiJS more than most others, including me. I was also unaware of the dates problem or that < and > is also not strict compared.
I would like to have this in the first release because of possible breaking changes.
Please don't hesitate improving binary indices at index-level with modes. :+1:
I am thinking it will be its own ts class (binary_index.ts?).
I will need to add an internal array of values so that it doesn't need to 'ask' collection for those. Might also optimize translation between data[] positions and sorted index[] positions and/or $loki id values.
The should really fix javascript dates in future js standard, i only deal with unix ms format within my data sets because of that.
Hey obeliskos, any progress on this? 😃
Nope but your timing is good and I'll start looking at it this weekend.
I have done most of the 'last mile' integration of binary tree index into loki. Since I am new to typescript and alot of the loki ts refactoring, feel free to criticize or suggest changes to better support long term maintenance. Once integrated you can still change at will if you want or are interested.
To summarize changes, BinaryTreeIndex is self balancing (AVL) tree implementation which allows duplicate values (stored internally as siblings).
In order to integrate BinaryTreeIndex into loki, I had to refactor comparators and add several interfaces to helper.js... we may eventually want to rename helper.ts into something like ranged_compare.js.
I also wanted to allow user to inject their own comparators and rangedIndex types (should they want to). So loki will maintain a map of named comparators and rangeIndex factory methods. Currently they defined globally within helper.ts but a possibility might be that we store them in loki instance and dependency inject them into collections... that is if we want to go with a DI approach.
// most users will not need to define these maps,
// this is only for injecting additional comparators and index types
let db = new Loki("test.db", {
comparatorMap : {
numberCompare: { compare(a,b)=> return a-b; } // js implementation of ILokiRangedComparer
},
rangedIndexFactoryMap: {
redblack: (name, comparator) => {
return new MyRedBlackIndexImplementation(name, comparator); // defined elsewhere in user code
}
}
});
// loki will register "js" and "loki" comparators into (global) ComparatorMap
// loki will register "btree" indexTypeName and factory function into (global) RangedIndexFactoryMap
let coll = db.addCollection("coll", {
rangedIndexes: [
{
age: { indexTypeName: "btree", comparatorName: "numberCompare" }
address: { indexTypeName, "bidx", comparatorName: "js" } // eventually we may convert to rangedindex
anotherDirtyProp: { indexTypeName: "redblack", comparatorName: "loki" }
dirtyData: {}, // implicit indexTypeName: bidx? implicit comparatorName : "loki"?
}
]
});
So currently the RangedIndex will be implemented as key/value with key=$loki (number) and val as generic. As such, when performing find() operations i have to look up each document with a get() to get each document reference or data aray position to add to result. For now the $loki index lookups seem to be adding low overhead. I have thought about replacing resultset filtered rows to contain $loki id's instead of data array positions, but this would probably be done along with either a conversion of old binary index to IRangedIndex implementation, or removal of old binaryIndices now that there is a replacement. BinaryTreeIndex has slightly more memory overhead than old binary indices but i think they may not be necessary.
Initial benchmark results show that BinaryTreeIndex is about as fast (maybe a little faster) than the old binary indices when find() ops are taken on their own. When index maintenance operations (insert, update, remove) are accounted for, BinaryTreeIndex is much faster. In order to more accurately measure performance, the benchmarks need to be invoked with --expose-gc command line parameter so that i can garbage collect memory in between benchmarks. There are still a few anomalies i need to look into, but they seem to show BinaryTreeIndex scaling much better to higher document counts. Currently, i am not checking in the built packages which they import, so you might need to do a build before running them.
Since the overhead of performing a get() seems low, I am thinking i will not replace the loki idIndex with hashmap similar to unique index for now. If you have different opinion or need for this, let me know. One anomaly in the above benchmarks is that loki get() seems to be faster than unique by() when garbage collection is performed in between benchmarks. If that proves the case, I may even suggest we convert unique index to be simple key/value of (propertyValue/$loki) and do a get() to resolve. Since our current Unique Index are not serialized and requires recreation load time (based on persisted column names), we could eliminate that load time delay by serializing the index itself and just getting rid of the 'uniqueNames' from Serialized interface.
I have not done anything to version 'upgrade' the database. It should be backwards compatible... if no ranged indexes were saved they would just be empty until populated. Should we decide to get rid of older binary indices and replace that with conversion to BinaryTreeIndex, we might add need to add that conversion logic to upgrade codepath.
I have been implementing code coverage for most of my additions but i have had somewhat limited visibility as to its accuracy or what travis rules might require before complaining. Perhaps during a merge PR i will be able to see before/after stats a little more clearly. Let me know if you foresee any issues which might be encountered.
So depending on your interest and available time to look into this, I am interested in whether you think the global ComparatorMap and RangedIndexFactoryMap should be organized or refactored differently.
If you see any areas where I have not implemented Typescript in the best manner, feel free to point that out... or any over questions or concerns before i begin looking into merging via pr.
After doing some preliminary tests on separate repo, I am somewhat dreading a rebase from master before merge PR, so I might tread lightly with experimentation of vs code conflict resolution. In my test I had to resolve conflicts leading to headless parent i had to publish to new branch before merge pr.
I did a rebase and it went smoothly so I went ahead and created a pull request to see what travis would say. Appears to have less code coverage than optimal... I will have to analyse to see if it points out where I might add more tests.
I won't rush the pull request in case you have interest in providing feedback... otherwise I may look into adding a few unit tests and merge later next week.
Thank you for your good summary! Now I have to understand what you are doing. ;)
I don't have much time this week so please, be patient with me.
No problem, no rush, and if you don't want to know all details, you might just make sure the interface is clean enough (helper.ts has most interfaces).
Hello obeliskos,
I have some first suggestions (just before I go on holidays for 4 days ;)).
The stuffed shirt in me: I improved our TSLint configuration to better standardize the code format.
Please rebase your branch and with npm run lint:fix
most, if not all the linter errors should be fixed for you. ;)
Second, I would suggest to drop the old binary index support. Your BinaryTree is much cleaner and not directly integrated into the Collection.
As upgrading support from LokiJS to LokiDB I only see to reindex with BinaryTrees - this would be done only once.
Thanks, good time for holidays... hope you enjoy yours :)
Thanks, I'll start that later today along with examining test coverage.
Early on, I was having issues with getting indexers to work with (string|number) so i hardcoded id (key) to be number ($loki) and just had a generic for value type. It seems you are able to do this with Dict
Hey @obeliskos, thank you for your ongoing work. But we have to revert (reset) the last two commits... Our changelog gets automatically generated from the commit messages if they follow the Commit Message Guidelines.
I backed up both commits at _backup/btree_indexfeature and _backup/remove_legacy_binaryindex.
If you are okay with it, I will git force back two commits. After that, you can create both PRs again and I would merge them. I am sorry.
Afterward, I would like to adjust the API and naming a little bit. With your agreement.
ok... not sure what i would need backup for but i would think the first pr could be replayed as is...
do what you need to do and i will try to resubmit them (i think second one might need rebase)
Hey @obeliskos, So I finished the commit messages issue after messing it by myself. :fallen_leaf:
I saw this commit message:
still need to determine best way to (re-implement) having indexed property on property with dot notation (nested prop)
I think this is obsolete. With the nestedProperties
options you can use dot notations on all queries and indexes (also in full-text search indexes).
great, thanks :)
The only other -index- infrastructure change i was thinking we should make is to UniqueIndex... We currently do not save unique indexes within serialized database but just an array of property names that we will regenerate unique indexes upon deserialization/load.
I was considering converting it into a very basic key value lookup with 'val' being the key and $loki being the value. An similar (second) key value lookup might be made from 'val' to the actual object instance. Then we might strip out the second kvp within toJSON (saving the actual index within db serialization), and on subsequent loads, the -first- by() access would lazy lookup the instance by $loki (which is nearly as fast).
I first intend to experiment to try to get an understanding as to why .get() ops seem faster than .by() ops in my benchmarks as that should not really be the case.
I am not sure we are at a point to consider converting loki.data from an array to a hashmap or whether that would be worth it.
And as for converting resultset.filteredrows into array of $loki ids rather than data[] positions, I am not sure I am ready for that either at the moment. We can take time to consider those changes.
Thanks for info about nestedProperties, not sure if and how to ensure test coverage of that.
Feel free to suggest API and naming adjustments.
Feel free to change the internal data structure if you feel this would be better. We don't have a release date yet so don't feel rushed.
I like the option for the user, that he can decide if the index should be stored or not (like in the full-text search option). Not all indexed values can be serialized.
I try to get some stuff done for the new binary tree tonight (UTC+2h). I will start by adding a unit test with nested properties for the binary tree.
I have created a new branch for unique index and lokimap optimizations.
I see in unique index (you?) removed the object references which is probably what i would have done. I added second map so we have two maps (loki->val) and (val->loki).
I also added a collection level 'lokimap' which maps (loki->obj) with obj being a data array object reference.
I modified collection.by() to directly use the index rather than use slower find() codepath.
I modified collection.get() to use lokimap whenever we do not pass true for second param to return data array position. Hopefully soon we can rewrite resultset.filteredrows to use $loki and lokimap rather than data array positions. And if we refactor data from array to map then our deletions and updates (which do data array splice) will speed up as we are just adding and removing properties rather than array manipulation.
collection.by is now as fast as it was in lokijs (10x faster than main branch) and collection.get is now just as fast when not requesting data array position (same perf as before if requesting dap as that path uses binary search)
The only thing i have left is to look into whether we should persist the unique index in the database or regenerate it every time... I am typically of the mind to persist them in database but not sure if you agree. Or should we allow user to specify whether to serialize indexes.
I see in unique index (you?) removed the object references
Yes, I did. ;)
so we have two maps (loki->val) and (val->loki)
I don't see any gain from this. Why do we need loki -> val?
The only thing i have left is to look into whether we should persist the unique index in the database or regenerate it every time... I am typically of the mind to persist them in database but not sure if you agree. Or should we allow user to specify whether to serialize indexes.
Let this to me. I will add this functionality (also for rangedIndex) later.
Do you plan to refactor filteredRows
, so that we only us $loki
as only data position identifier?
Why have two maps? Well, for one example to get rid of the iteration we needed to do in the update. From an abstract perpective it enforces 1-1 mapping, whereas a single hashmap would seem to enforce only many-1 mapping. Remove accepts $loki and can easily computed the 'cached value' (whether cloning was enabled or not) which acts as a directly hash index for the other map, and vice versa for updates or inserts.
I was intending to use lokimap (not only for performance boost) but ideally, yes to eventually replace usage (such as in filteredrows), to using $loki instead. Once all codepaths no longer rely on data array position we could convert collection.data to map itself and get rid of lokimap. Certain areas such as iteration performance (object.keys) for unindexed ops remains to be seen, but converting to $loki should simplify things.
Having said that, do you have any issues with my merging this into master?
Actually, it would probably be better to not pass in doc and have all method params be either loki (number) or val (any). The doc probably will never matter if we do not cache object reference.
After having looked into not passing the document as a parameter for any of the unique index maintenance functions, I think I do like that a lot better. But first, to take a step back and summarize (bear with me) :
So, the unique index in LokiJS really maintained document references because that was faster than using ($loki) idIndex.
Then you ported to Typescript, and you utilized data array position as rather fast (external to index) array index 'map' (able to achieve theoretically similar speeds) and removed object references. From a typescript perspective it was still aware of loki documents and accepted them as parameters.
I think it would be best to turn this into a purely functional key/val map construct which does not need to be aware of loki constructs. My current implementation has gone one step further and enhanced into bidirectional map which ensures optimal performance at cost of double memory overhead for internal data structures.
Also, being a simplified, more abstract data structure and :
... I think it would be best to remove the generic type variables and keyof annotations and treat as pure number/any hashmap(s).... after all, our 'values' were being used as hash indexer which in javascript will force a toString() for any Dates, numbers, etc. (I guess that means values of 12 and '12' would be duplicates... now as well as previously).
So back to the bidirectional map, unique index was always intended to be 'peak' performance index, so I considered it a given that the only real way to optimize further is to speed up maintenance costs in conjunction with re-aligning its purpose as outlined above. I can understand the argument in favor of lowering memory overhead at cost of needing extra cpu (iteration), and if you are adamantly opposed to second map, we can reconsider (later?). We don't even need to serialize for time being, I think the other realignments are more significant. If we don't serialize, RAM is generally more abundant than storage for all but more high end purposes, which we can state to increase node memory allocation :)
Also, previously mentioned, collection.find() was adding a significant amount of overhead to calls to collection.by(). I do like that find ops can utilize unique index without calling by(), but until we can clean up that code path, directly accessing the index from by() is much faster.
So, although I thought you were following my intent and objectives, I hope the above will clarify that some more. Feel free to argue or request further clarification wherever you have concerns.
Once the above is resolved, I would probably next do (simple) pr to rename "btree" and "btree-index.ts" to "avl" and "avl-index.ts" to avoid confusion in the future... I might rename BinaryTreeIndex class itself to AvlTreeIndex. Those changes should be simple but intersect a few files.
Then, more important than converting resultset to use $loki, i think is that (currently) all of our indexed ops go through slower "loki" comparators (aeqHelper, gtHelper, ltHelper), when we should probably allow user to specify which comparator they want to use for these, either at a default 'collection' level, property level, and/or as option to find() method. Persons with clean data having single typed values for a property could utilize js comparators, while still having an option for utilizing "loki" comparisons for problematic or mixed types. We could allow the user to express this similary to how our index options in loki constructor, where user can pass object with "property": "comparatorName" mappings.
Then there might be converting resultset (and by proxy dynamic view) to use $loki ids. This would be a more of a "formality" since i would still have the data array as an array i could iterate. Having the filteredrows array store $loki id is more to make it easier to inspect during debugging or in unit tests. In fact, updating the unit tests which do raw expectations of data array 'indices' would need to be replaced to be something more meaningful than their cryptic calculations which they are now.
Only after all of that would I experiment and benchmark with performance differences between having data as an array or hashmap. I am not certain it will be worth it, but it is worth experimenting with (on a slower paced experimental pr). If you depend on (loki) data array positions within your inverted index packages/classes, let me know. Whichever way we decide to go for collection.data for a non-beta release, all other above changes still have merits of their own.
The above are pretty much all the main issues i think would re-align LokiDB to remove past 'clutter' and confusion to free up all new confusion you or I might want to add :)
Let me know if you have any issues or would like me to take a break for a bit after this so you can re-orient.
Once the above is resolved, I would probably next do (simple) pr to rename "btree" and "btree-index.ts" to "avl" and "avl-index.ts" to avoid confusion in the future... I might rename BinaryTreeIndex class itself to AvlTreeIndex. Those changes should be simple but intersect a few files.
I like it to be specific, since it is not a B-Tree. ;)
If you depend on (loki) data array positions within your inverted index packages/classes, let me know. Whichever way we decide to go for collection.data for a non-beta release, all other above changes still have merits of their own.
Not at all. Currently I use the array index but $loki would also work.
This is a huge task you have planned! Good luck! I am happy over every PR.
"What, if anything, the B stands for has never been established." :)
Thanks
For my own personal knowlege, and since I have made considerable changes to the benchmarks from LokiJS to LokiDB, I wanted to test them head-to-head with the same benchmark, so I did a test of that... requiring both libraries and switching between them to compare.
For the most part unique/get and unindexed finds were very similar, so much so I wont bother posting.
Where i was really interested was difference between LokiDB/AVL and LokiJS/BinaryIndices (adaptive)... here are the results so I can mentally file them before moving on :
Finds (Low Document Count - 20K) : (BI) 716,537 ops/s => (AVL) 1,096,431 ops/s
Finds (High Document Count - 200K) (BI) 266,152 ops/s => (AVL) 1,086,366 ops/s
Nightmare/Maintenance (Low Document Count - 40K) insert+find : (BI) 55,286 ops/s => (AVL) 216,837 ops/s remove+find : (BI) 23,702 ops/s => (AVL) 270,268 ops/s update+find : (BI) 127,481 ops/s => (AVL) 201,116 ops/s
Nightmare/Maintenance (High Document Count - 100K) insert+find : (BI) 15,780 ops/s => (AVL) 235,737 ops/s remove+find : (BI) 6,218 ops/s => (AVL) 310,231 ops/s update+find : (BI) 7,099 ops/s => (AVL) 205,482 ops/s
So LokiDBs AVL outperformance LokiJS' binary index. :+1:
That's great news and also it is perfect for advertising LokiDB once it is released. ;)
Hey @Viatorus, I have had to re-think my approach to this so here is current expectation of re-architecting changes :
I -was- planning on simply allowing injecting a named comparator (now that we have map of comparator functions) into unindexed LokiOps, so that an (unindexed) $lte would be determined by a arbitrary named comparator.
The overhead of that extra function call compared to the default LokiOps inline comparisons (such as ===) seems a little too high of a performance cost to get that level of abstraction within a single LokiOps. It is probably about 30% overhead which i think is too much for our optimized default code paths.
So, I am thinking of taking our 'maps' one step further and allowing for multiple named instances of LokiOps, each (potentially) having their own implementations of those ops.
So (at least on the unindexed comarators map branch for now), I am planning on adding some type of LokiOperatorMap and a LokiOperators class type with default implementation being the (fastest) pure js comparisons (for $eq, $lt, $gt, $gte, $lte, $between). I will map that with a name such as "default" or "js" within LokiOperatorMap.
I will also create a "loki" or "abstract" class (which extends LokiOperators) and override the simple comparison ops to use aeqHelper, gtHelper, ltHelper and map that into LokiOperatorMap.
I will also create a "comparator" class which extends LokiOperators and allow the user to register comparator function to be called for those basic comparison ops. If the user wants to (for example) run a find with string insensitive comparisons, they might pick the 'comparator' LokiOperators along with a string insensitive comparator function to perform that.
When user creates a collection, I will add option for defaultLokiOperators (string) where they can choose which operator set to use for all find queries. There could potentially be options on our find methods to override that on query by query basis as well.
So to summarize, as part of an effort to allow the user to have fine-grained control over LokiOps and comparators, (and ranged indexes) they can utilize 'ComparatorMap', 'RangedIndexFactoryMap', and 'LokiOperatorMap' to customize functionality. They could in theory create their own loki ops class which extends the base 'LokiOperators' and override or inject new ops with there as well. Or they could override existing LokiOperators for the default sets we bundle.
Let me know if you foresee any issues with the above, i think that overhead will be almost zero for find ops as we can get/retain operators from map once before running ops on each document. Names are also variable, so let me know if you have any preference for names. I was planning on moving the LokiOps into 'helper' which will probably need to be renamed as soon as I can think up a good name for that (ranged index, comparators, ops) mashup of interfaces, classes, and instances.
Thank you for thinking this through.
I hope the user don't get to confused. Maybe we can adopt some naming convention from other databases.
Overall great idea!
With the latest commit 05fa442 I think I am mostly complete with the feature and refactorings.
To summarize, in loki constructor the user can inject their own :
Within collection, user can define :
In the future i can see our resultset.find method accepting an options object rather than just bool firstOnly. In that find object, we might allow user to specify override for ops package.
We will probably need 'wiki-like' help for some of the higher level documentation of items such as this. I am thinking maybe we should do this within markup within project (and linked to from lokidb wiki page). Any objection to doing that, and if not, where in our folder structure might we do that so as to not conflict with existing /docs folder which I believe will be used for autogenerated documentation reference?
I have planned to add and clean up unit test coverage around this, but if you think you will be wanting to make your own changes/reorganizations soon let me know and i will pr this and change later. Having two people make massive changes at this stage can be cumbersome :S If not, I hope to clean up and p.r. this sometime next week.
We will probably need 'wiki-like' help
Please use the docs folder for this. We are using mkdocs for documentation. Create a new markdown file and write your documentation. Each release updates the documentation. :)
For example, I have this documenation in the pipeline: https://github.com/LokiJS-Forge/LokiDB/blob/docs/docs/full-text-search.md
Having two people make massive changes at this stage can be cumbersome
Currently, I don't hav that much time, so please go ahead. ;)
Hey @Viatorus, I think with pr #152 this issue is done... can you review that and see if you find any issues with it?
There are a few markdown files in doc that might help explain areas of changes.
Hey obeliskos, Thank you. I will review it tonight (UTC+2).
Thanks for the review @Viatorus...
A few suggestions about how I think we should handle of pr's in the future :
I don't think we should necessarily -require- code reviews for collaborators. That feature was recently turned on, but I don't think it should be. My last few pr's have been fairly significant, so it makes sense for you to ask to, or me to ask if you would want to, review pr's like these (if your time permits) to ensure we are agreement of major changes and additionally to find and question potential bugs, best practices, typos, etc. Perhaps if we get 5 or 10 active contributors, I can see requiring code reviews being a good policy requirement, but for just us I hope we know what the other has concerns about and might want to review.
If either of us would like to review pull requests, we can clarify which or all of these we would like to review and the other can assign and await the review for those pr's. If either of us are to provide a code review, we should just ask questions, request changes, or approve. As collaborators, we should probably all be committing our own pull requests.
I think we've both made good efforts to collaborate better with each other's workflow. If there are issues with the above or any remaining areas you would like to see improved, feel free to discuss here or in another issue.
I'll close out this issue soon, as the scope I had for it has been achieved. As such, the -major- changes which I had considered 'required for v1.0' have been implemented.
Let me know if you think it would be a good idea for us to create a new (or use an existing) issue to itemize any remaining changes we would like see make it into a version 1 release. That might allow ourselves and others can better gauge our progress towards that.
I hope you didn't take this personal after I enabled the review requirements. I just thought this is a good feature to keep track on code changes for both of us. Sorry that I merged your PRs after review. This should be as you said the part of the author and not reviewer.
I will undo it.
No worries... thanks 😃
I'm submitting a...
In LokJS, our ops 'evolved' when adding binary indices and to adapt to problems with javascript equality and relational ops when dealing with dirty data and mixed types. Recently we have even added ops for explicitly opting for js relational ops ($jgt, $jgte, $jlt, $jlte, $jbetween)... not for functional output differences but just because they are faster in some situations.
I also suggest me might add different 'modes' for dealing with relational operations... by default we should probably continue to expect that users will throw dirty data at it and we will need to rely on 'loki sorting' to establish a reliable 'range' over which our binary indices can work. But many people may be producing their own data internally and have confidence over the cleanliness of their data... especially if they also adopt typescript for code that creates documents they insert.... in those cases we might want to 'reroute' our $ltHelper and $gtHelper to utilize native javascript relational and equality operators... doing so should increase performance even more.
I will propose suggestions later but for now I will just attempt to quantify/itemize factors involved in this for discussion.