Closed qpwo closed 2 years ago
Could be something with the async functions, maybe causing some kind of mutex thrashing? I don't think that two update()
calls ever run over each other.
Thanks for your detailed report!
On the database side there are a couple of things to consider:
[1]
to [3]
) before canceling the write lock on the resource. If the update takes longer, it will deny further lock requests for that transaction, effectively canceling them. The warnings are shown every 30s and effectively canceled after 2 minutes. I will be making this timeout optional and configurable soon, but you can increase it yourself by editing LOCK_TIMEOUT constant in acebase/src/node-lock.js. Another option is making your batches smaller so they are able to be stored before timing out.Looking at your code, you might also want to try preparing the 10,000 records synchronously and then do the batch update. I don't think that will make a huge difference, but now it's an additional tick per record. Maybe also try to reduce the batch size to 1,000 records at a time.
I've recently worked on developing a streaming import for json data, but the current implementation is (too) slow. Improvements are coming, but will take some time. I might actually look at implementing a csv import before then, because of it simple "flat" format that makes it very easy to create batches. Note that transforming the import data (as in your code) will not be possible then, so batch updating will remain the preferred strategy in your case.
Let me know if you are able to speed up the process with above info, I'll also perform some tests with generated data.
I've done some testing with generated data, I definitely agree its performance must be improved here. I have a hunch where the bottleneck might be, I'll dive deeper into it..
Where do you think it might be?
There's multiple places, but I found the biggest performance improvements can be made in the way existing child nodes are queried, and new ones are added. For large object collections, an index is created for the child nodes to enable quick lookups. When doing small queries and updates, it's fast enough to query the index for each requested child - but that becomes really slow if 10,000 children have to be looked up one at a time. Similarly, adding new entries to an index one at a time is not very fast when 10,000s are added.
I've made quite a few improvements that allow multiple index lookups and storing at the same time, I am seeing performance improvements in the 20x range already in my current tests. These changes do impact critical parts of the storage engine so I'll have to make 100% none of this can corrupt a database at any time, so I'll be doing extensive testing this week.
As you can see, truckloads of commits! Improved many parts of the code and tackled quite a number of issues along the way, tests are now looking good, I'm performing last long-running tests this weekend. If all stays this way I'll publish to npm Monday š„³
I just published acebase version 1.15.0, let me know if it works!
Appreciated I'll give it a run
@qpwo Any news about this?
Hey sorry to report I'm getting 30 seconds on 10k batches and a crash after 6 batches.
I had a bug in my script I wasn't awaiting my transactions!
So I'm getting about 30 seconds to a minute per 10k records. So for the full 3 million records I would expect it to take about (30 seconds * 3 million / 10 k) = 2 1/2 hours . That's not crazy slow.
Full test script and logs here
https://gist.github.com/qpwo/1445f4a7053ba5e712ea2628eb1c6e38
Is there a simple example somewhere of using multiple CPU cores?
100k records in 10k batches took a total of 389 seconds.
I should also mention each record is a list of up to like 10k <40 character strings. (Maybe averaging a few dozen strings per record.)
Is there a simple example somewhere of using multiple CPU cores?
Sorry for the late reply. See Standard Node.js clusters for info about running multiple threads. Kindly note you can't use multiple threads to speed up your import because each thread would acquire an exclusive write lock on the target collection - they'll effectively just be taking turns in importing batches.
I'm also facing the issue here, acebase is just simply too slow for it. I've decided to use LokiJS instead which is way quicker (1 Million entries only took 1206ms to insert, 878ms for query, 896ms for updating and 2555ms to save it in IndexeDB)
acebase in this case rather takes 1 mins in some area :/
((Disk can never compete with memory -- different use cases right))
Sorry, I misread - I thought acebase was stored in memory too. Different usecase, yep :)
Hey I'm wondering if I'm doing something wrong. I have a 4GB CSV file where each line is points one string to a list of strings. I have about 3 million lines.
update()
. So I'm batching my updates into groups of 10k rows, but each batch is still taking a minute or two, so the whole file is going to take five-ten hours at this rate. I can't do the whole thing at once because I'll run out of RAM.--max-old-space-size=12000
write lock on path "whatever" by tid whatever (_lockAndWrite "whatever") is taking a long time to complete [1]
Here's the main bit of code:
Any suggestions for loading the data in faster?