Open daywiss opened 10 years ago
can you tell me more about the usecase? are you doing a bulk load of an entire db? or is this an partial update?
I would be happy to merge a pull request for this.
I have a ton of data i am trying to initialize the database with, probably like half a million or more small json inserts. The data is currently being streamed in from files on the local machine, transformed into json then inserted in the database. I am using a multilevel database which is running as a separate application.
Originally I was just doing puts, and waiting for the callback before proceeding with more puts, assuming the callback would do some kind of flow control, but this was not the case. Eventually the multi-level server would crash.
So now I pipe to a writestream on each sublevel. I thought this would fix the problem, but the db server is still crashing about half way through the data.
This is less of an issue with sublevel and more of me trying to figure out how to pipeline my insert process without overwhelming the database or introducing some kind of artificial slowdown.
but what is the data? Is it data you are pulling out of an SQL database? or csvs? is it from one file or many files? I ask because it's much easier to analyze a problem if I can visualize it.
It sounds like this is really just a bulk loading problem - I know these can be difficult, and I don't think we have a general purpose solution yet - but I know that @maxogden knows the most about this and also @brycebaril
data is coming from excel files(xls). Each row in a file contains some time stamped data. There are about 40k excel files with maybe 10 or so rows each. Im reading them in one file at a time and creating a stream of rows which eventually each get inserted as a unique key based on the timestamp.
fortunately I only need to do this once, and i can restart the server and continue where I left off for the most part, so I can get this done, its just something I have been trying to work out.
What exactly is the bulk loading problem? Are the streams that leveldb creates not full streams? Perhaps this is due to also my use of level-sec which is creating secondary indices that are also getting pushed into the database at the same time?
Oh, now knowing that changes everything!
What excel parser are you using? I was looking into this recently and ended just shelling out to https://www.npmjs.org/package/j but was to be honest, pretty horrified to read the code.
I wouldn't be surprised if there was a memory leak in the xls stuff, does it still crash if you load a plain file? you have a lot of moving parts here, can you make a simple script that reproduces the problem - what sort of crash is this? can you post the error message?
But to be honest, since you only need to do this once - the simplest would just to have a way that it can recover from the last successful write before the crash, if the timestamps are in order then you could just look in the db for the last timestamp, and then open the next file.
The only excel parser i found that works on the files I have is pyspreadsheet. It has its own issues, but i've managed to work around them.
multilevel is running as a separate application from everything else, and its this that is crashing. I have gotten a couple different error messages when it crashes:
_stream_writable.js:0
(function (exports, require, module, __filename, __dirname
) { // Copyright Joy
^
RangeError: Maximum call stack size exceeded
and this
...../node_modules/multilevel/node_modules/mux-demux/node_modules/json-buffer/index.js:34
return JSON.stringify(/^:/.test(o) ? ':' + o : o)
^
RangeError: Maximum call stack size exceeded
which might be related to this: https://github.com/dominictarr/level-live-stream/issues/7
It makes me think that the the write stream on leveldb is not pausing the incoming stream and eventually overflowing. But i dont really know the details of how it all works, I am just speculating.
I might try to make a test case to duplicate this problem.
oh, those are both RangeErrors
and have nothing to do with backpressure. Can you get an input that reproduces the error.
do you have the latest multilevel? it looks like one of those errors comes from json-buffer. can you post the output of:
npm ls
I wouldn't be surprised if there was a memory leak in the xls stuff
@daywiss could you test against j directly or against http://oss.sheetjs.com/js-xls/ or http://oss.sheetjs.com/js-xlsx/ ? If there is a file that can't be read, I'd definitely like to investigate further (especially if it is a memory leak)
@dominictarr I have not been able to make test case to reproduce the problem sadly, though I am still experiencing this range error crash in my application.
multilevel version 6.0.0, json-buffer 2.0.9
@SheetJSDev When I try to drop my xls file onto your app nothing happens, when i paste it as base64 nothing happens. Here is the base-64
CQAEAAIAEAADAA8AAQACAAAAAAAAAAAAAABAAwAPAAEAAwAAAAAAAAAAAJ/CQAMADwABAAQAAAAA
i2zn+ykNkEADAA8AAQAFAAAAAM/3U+OlE0FAAwAPAAEABgAAAAAAAAAAAAAAAAMADwACAAIAAAAA
AAAAAAAAAEADAA8AAgADAAAAAAAAAAAAssJAAwAPAAIABAAAAABKDAIrhwiQQAMADwACAAUAAAAA
CFVq9kAHQUADAA8AAgAGAAAAAAAAAAAAAAAAAwAPAAMAAgAAAAAAAAAAAAAAQAMADwADAAMAAAAA
AAAAAADFwkADAA8AAwAEAAAAAEoMAisHCJBAAwAPAAMABQAAAAD99nXgnA1BQAMADwADAAYAAAAA
AAAAAAAAAAADAA8ABAACAAAAAAAAAAAAAABAAwAPAAQAAwAAAAAAAAAAANjCQAMADwAEAAQAAAAA
XrpJDAIOkEADAA8ABAAFAAAAAMeA7PXuD0FAAwAPAAQABgAAAAAAAAAAAAAAAAMADwAFAAIAAAAA
AAAAAAAAAEADAA8ABQADAAAAAAAAAACA6sJAAwAPAAUABAAAAACF61G4HuqPQAMADwAFAAUAAAAA
aTUk7rEQQUADAA8ABQAGAAAAAAAAAAAAAAAAAwAPAAYAAgAAAAAAAAAAAAAAQAMADwAGAAMAAAAA
AAAAAID9wkADAA8ABgAEAAAAADMzMzMzD5BAAwAPAAYABQAAAABA3qtWJghBQAMADwAGAAYAAAAA
AAAAAAAAAAADAA8ABwACAAAAAAAAAAAAAABAAwAPAAcAAwAAAAAAAAAAgBDDQAMADwAHAAQAAAAA
xSCwcmj0j0ADAA8ABwAFAAAAAEmAmlq2FkFAAwAPAAcABgAAAAAAAAAAAAAAAAMADwAIAAIAAAAA
AAAAAAAAAEADAA8ACAADAAAAAAAAAACAI8NAAwAPAAgABAAAAADFILByaBKQQAMADwAIAAUAAAAA
L6hvmdMNQUADAA8ACAAGAAAAAAAAAAAAAAAAAwAPAAkAAgAAAAAAAAAAAAAAQAMADwAJAAMAAAAA
AAAAAIA2w0ADAA8ACQAEAAAAAAIrhxbZ4I9AAwAPAAkABQAAAAAL73IR3xlBQAMADwAJAAYAAAAA
AAAAAAAAAAADAA8ACgACAAAAAAAAAAAAAABAAwAPAAoAAwAAAAAAAAAAgEnDQAMADwAKAAQAAAAA
j8L1KNwAkEADAA8ACgAFAAAAALjM6bKYKEFAAwAPAAoABgAAAAAAAAAAAAAAAAQADAAAAAAAAAAA
BERhdGUEAAwAAAABAAAAAARUaW1lBAANAAAAAgAAAAAFRGF0YTEEAA0AAAADAAAAAAVEYXRhMgQA
DQAAAAQAAAAABURhdGEzBAANAAAABQAAAAAFRGF0YTQEAA0AAAAGAAAAAAVEYXRhNQQADQAAAAcA
AAAABURhdGE2BAANAAAACAAAAAAFRGF0YTcEAA0AAAAJAAAAAAVEYXRhOAQAEgABAAAAAAAACjA3
LzE1LzIwMTQEABAAAQABAAAAAAgxNDozMDowNAQAEgACAAAAAAAACjA3LzE1LzIwMTQEABAAAgAB
AAAAAAgxNDozMTowNAQAEgADAAAAAAAACjA3LzE1LzIwMTQEABAAAwABAAAAAAgxNDozMjowNAQA
EgAEAAAAAAAACjA3LzE1LzIwMTQEABAABAABAAAAAAgxNDozMzowNAQAEgAFAAAAAAAACjA3LzE1
LzIwMTQEABAABQABAAAAAAgxNDozNDowNAQAEgAGAAAAAAAACjA3LzE1LzIwMTQEABAABgABAAAA
AAgxNDozNTowNAQAEgAHAAAAAAAACjA3LzE1LzIwMTQEABAABwABAAAAAAgxNDozNjowNAQAEgAI
AAAAAAAACjA3LzE1LzIwMTQEABAACAABAAAAAAgxNDozNzowNAQAEgAJAAAAAAAACjA3LzE1LzIw
MTQEABAACQABAAAAAAgxNDozODowNAQAEgAKAAAAAAAACjA3LzE1LzIwMTQEABAACgABAAAAAAgx
NDozOTowNAoAAAA=
@daywiss Thanks for sharing! This is a BIFF 2.0 file (the format of Excel 2.0 from 1987). Out of curiosity, how did you generate this file? I ask because the latest versions of Excel don't even allow you to save in the format. Just opening and trying to save the file triggers an error:
@SheetJSDev Its generated by software that is used to monitor communications from Programmable Logic Controllers (PLC). Unfortunately I don't have a solution which can replace this software and therefore have to deal with this excel format. Just FYI it will open in the LibreOffice Calc program(excel clone).
i have several sublevels which i need to write to with a stream. I was hoping not to have to create a write stream for each sublevel, but a writestream on the root database does not get the data into the sublevel when using the prefix option like this:
The reason I need to do this is because i need to respect back pressure from leveldb as i might be overflowing the buffer when doing a lot of inserts and causing the db to crash.