appy-one / acebase

A fast, low memory, transactional, index & query enabled NoSQL database engine and server for node.js and browser with realtime data change notifications
MIT License
488 stars 27 forks source link

How to speed up loading in 4GB CSV? #65

Closed qpwo closed 2 years ago

qpwo commented 2 years ago

Hey I'm wondering if I'm doing something wrong. I have a 4GB CSV file where each line is points one string to a list of strings. I have about 3 million lines.

write lock on path "whatever" by tid whatever (_lockAndWrite "whatever") is taking a long time to complete [1]

Here's the main bit of code:

  let giantObj: Record<string, string[]> = {}

  await processLines(path + '/data.tsv', async (line, num) => {
      if (num % 10_000 === 0) {
          log('on line', frac(num, numStargazerRows))
          log('putting batch in database:')
          await db.ref('gazers').update(giantObj)
          log('batch done')
          log('freeing old obj hopefully')
          giantObj = {}
      }
      const cols = line.split('\t')
      const head = cols[0].replace('/', '!!')
      const tail = cols.slice(1)
      giantObj[head] = tail
  })

Any suggestions for loading the data in faster?

qpwo commented 2 years ago

Could be something with the async functions, maybe causing some kind of mutex thrashing? I don't think that two update() calls ever run over each other.

appy-one commented 2 years ago

Thanks for your detailed report!

On the database side there are a couple of things to consider:

Looking at your code, you might also want to try preparing the 10,000 records synchronously and then do the batch update. I don't think that will make a huge difference, but now it's an additional tick per record. Maybe also try to reduce the batch size to 1,000 records at a time.

I've recently worked on developing a streaming import for json data, but the current implementation is (too) slow. Improvements are coming, but will take some time. I might actually look at implementing a csv import before then, because of it simple "flat" format that makes it very easy to create batches. Note that transforming the import data (as in your code) will not be possible then, so batch updating will remain the preferred strategy in your case.

Let me know if you are able to speed up the process with above info, I'll also perform some tests with generated data.

appy-one commented 2 years ago

I've done some testing with generated data, I definitely agree its performance must be improved here. I have a hunch where the bottleneck might be, I'll dive deeper into it..

qpwo commented 2 years ago

Where do you think it might be?

appy-one commented 2 years ago

There's multiple places, but I found the biggest performance improvements can be made in the way existing child nodes are queried, and new ones are added. For large object collections, an index is created for the child nodes to enable quick lookups. When doing small queries and updates, it's fast enough to query the index for each requested child - but that becomes really slow if 10,000 children have to be looked up one at a time. Similarly, adding new entries to an index one at a time is not very fast when 10,000s are added.

I've made quite a few improvements that allow multiple index lookups and storing at the same time, I am seeing performance improvements in the 20x range already in my current tests. These changes do impact critical parts of the storage engine so I'll have to make 100% none of this can corrupt a database at any time, so I'll be doing extensive testing this week.

appy-one commented 2 years ago

As you can see, truckloads of commits! Improved many parts of the code and tackled quite a number of issues along the way, tests are now looking good, I'm performing last long-running tests this weekend. If all stays this way I'll publish to npm Monday šŸ„³

appy-one commented 2 years ago

I just published acebase version 1.15.0, let me know if it works!

qpwo commented 2 years ago

Appreciated I'll give it a run

appy-one commented 2 years ago

@qpwo Any news about this?

qpwo commented 2 years ago

Hey sorry to report I'm getting 30 seconds on 10k batches and a crash after 6 batches.

I had a bug in my script I wasn't awaiting my transactions!

So I'm getting about 30 seconds to a minute per 10k records. So for the full 3 million records I would expect it to take about (30 seconds * 3 million / 10 k) = 2 1/2 hours . That's not crazy slow.

Full test script and logs here

https://gist.github.com/qpwo/1445f4a7053ba5e712ea2628eb1c6e38

qpwo commented 2 years ago

Is there a simple example somewhere of using multiple CPU cores?

100k records in 10k batches took a total of 389 seconds.

I should also mention each record is a list of up to like 10k <40 character strings. (Maybe averaging a few dozen strings per record.)

appy-one commented 2 years ago

Is there a simple example somewhere of using multiple CPU cores?

Sorry for the late reply. See Standard Node.js clusters for info about running multiple threads. Kindly note you can't use multiple threads to speed up your import because each thread would acquire an exclusive write lock on the target collection - they'll effectively just be taking turns in importing batches.

appy-one commented 2 years ago

I'm closing this for now. Feel free to comment later, I'll post an update once I've worked on this again.

Spread the word contribute Sponsor AceBase

YanDevDe commented 1 year ago

I'm also facing the issue here, acebase is just simply too slow for it. I've decided to use LokiJS instead which is way quicker (1 Million entries only took 1206ms to insert, 878ms for query, 896ms for updating and 2555ms to save it in IndexeDB)

acebase in this case rather takes 1 mins in some area :/

qpwo commented 1 year ago

((Disk can never compete with memory -- different use cases right))

YanDevDe commented 1 year ago

Sorry, I misread - I thought acebase was stored in memory too. Different usecase, yep :)