How to speed up loading in 4GB CSV?

qpwo commented 2 years ago

Hey I'm wondering if I'm doing something wrong. I have a 4GB CSV file where each line is points one string to a list of strings. I have about 3 million lines.

I tried setting them one at a time and it was taking a while.
I noticed readme recommends doing large batches with update(). So I'm batching my updates into groups of 10k rows, but each batch is still taking a minute or two, so the whole file is going to take five-ten hours at this rate. I can't do the whole thing at once because I'll run out of RAM.
I have logging set to error only
I think node's memory should be 12gb because I set --max-old-space-size=12000
Node is running in Terminal.app, not vscode's integrated terminal because I've heard that can slow processes
Loading the whole file into object batches without doing acebase only took a couple minutes
After each batch update completes I get a bunch of these messages:

write lock on path "whatever" by tid whatever (_lockAndWrite "whatever") is taking a long time to complete [1]

Here's the main bit of code:

  let giantObj: Record<string, string[]> = {}

  await processLines(path + '/data.tsv', async (line, num) => {
      if (num % 10_000 === 0) {
          log('on line', frac(num, numStargazerRows))
          log('putting batch in database:')
          await db.ref('gazers').update(giantObj)
          log('batch done')
          log('freeing old obj hopefully')
          giantObj = {}
      }
      const cols = line.split('\t')
      const head = cols[0].replace('/', '!!')
      const tail = cols.slice(1)
      giantObj[head] = tail
  })

Any suggestions for loading the data in faster?

qpwo commented 2 years ago

Could be something with the async functions, maybe causing some kind of mutex thrashing? I don't think that two update() calls ever run over each other.

appy-one commented 2 years ago

Thanks for your detailed report!

On the database side there are a couple of things to consider:

Running AceBase in a single Node.JS process means only 1 CPU thread doing all the work. Although AceBase is able to run in clusters, it does not share already assigned workload with other processes. There is no internal load balancer, good to keep that in mind!
There is currently a hard limit on how long a database operation is allowed to run, it will show the warnings you mentioned 3 times ([1] to [3]) before canceling the write lock on the resource. If the update takes longer, it will deny further lock requests for that transaction, effectively canceling them. The warnings are shown every 30s and effectively canceled after 2 minutes. I will be making this timeout optional and configurable soon, but you can increase it yourself by editing LOCK_TIMEOUT constant in acebase/src/node-lock.js. Another option is making your batches smaller so they are able to be stored before timing out.
If you have indexes on the data you are importing to, consider removing them until your import is done. An index slows down inserts. I assume this is not the case here, but I do want to mention it.
If you have transaction logging enabled on your db, this will also slow down inserts and you might also want to disable that.

Looking at your code, you might also want to try preparing the 10,000 records synchronously and then do the batch update. I don't think that will make a huge difference, but now it's an additional tick per record. Maybe also try to reduce the batch size to 1,000 records at a time.

I've recently worked on developing a streaming import for json data, but the current implementation is (too) slow. Improvements are coming, but will take some time. I might actually look at implementing a csv import before then, because of it simple "flat" format that makes it very easy to create batches. Note that transforming the import data (as in your code) will not be possible then, so batch updating will remain the preferred strategy in your case.

Let me know if you are able to speed up the process with above info, I'll also perform some tests with generated data.

appy-one commented 2 years ago

I've done some testing with generated data, I definitely agree its performance must be improved here. I have a hunch where the bottleneck might be, I'll dive deeper into it..

qpwo commented 2 years ago

Where do you think it might be?

appy-one commented 2 years ago

There's multiple places, but I found the biggest performance improvements can be made in the way existing child nodes are queried, and new ones are added. For large object collections, an index is created for the child nodes to enable quick lookups. When doing small queries and updates, it's fast enough to query the index for each requested child - but that becomes really slow if 10,000 children have to be looked up one at a time. Similarly, adding new entries to an index one at a time is not very fast when 10,000s are added.

I've made quite a few improvements that allow multiple index lookups and storing at the same time, I am seeing performance improvements in the 20x range already in my current tests. These changes do impact critical parts of the storage engine so I'll have to make 100% none of this can corrupt a database at any time, so I'll be doing extensive testing this week.

appy-one commented 2 years ago

As you can see, truckloads of commits! Improved many parts of the code and tackled quite a number of issues along the way, tests are now looking good, I'm performing last long-running tests this weekend. If all stays this way I'll publish to npm Monday 🥳

appy-one commented 2 years ago

I just published acebase version 1.15.0, let me know if it works!

qpwo commented 2 years ago

Appreciated I'll give it a run

appy-one commented 2 years ago

@qpwo Any news about this?

qpwo commented 2 years ago

~~Hey sorry to report I'm getting 30 seconds on 10k batches and a crash after 6 batches.~~

I had a bug in my script I wasn't awaiting my transactions!

So I'm getting about 30 seconds to a minute per 10k records. So for the full 3 million records I would expect it to take about (30 seconds * 3 million / 10 k) = 2 1/2 hours . That's not crazy slow.

Full test script and logs here

https://gist.github.com/qpwo/1445f4a7053ba5e712ea2628eb1c6e38

qpwo commented 2 years ago

Is there a simple example somewhere of using multiple CPU cores?

100k records in 10k batches took a total of 389 seconds.

I should also mention each record is a list of up to like 10k <40 character strings. (Maybe averaging a few dozen strings per record.)

appy-one commented 2 years ago

Is there a simple example somewhere of using multiple CPU cores?

Sorry for the late reply. See Standard Node.js clusters for info about running multiple threads. Kindly note you can't use multiple threads to speed up your import because each thread would acquire an exclusive write lock on the target collection - they'll effectively just be taking turns in importing batches.

appy-one commented 2 years ago

I'm closing this for now. Feel free to comment later, I'll post an update once I've worked on this again.

YanDevDe commented 1 year ago

I'm also facing the issue here, acebase is just simply too slow for it. I've decided to use LokiJS instead which is way quicker (1 Million entries only took 1206ms to insert, 878ms for query, 896ms for updating and 2555ms to save it in IndexeDB)

acebase in this case rather takes 1 mins in some area :/

qpwo commented 1 year ago

((Disk can never compete with memory -- different use cases right))

YanDevDe commented 1 year ago

Sorry, I misread - I thought acebase was stored in memory too. Different usecase, yep :)

appy-one / acebase

How to speed up loading in 4GB CSV? #65