Speedup parsing large JSON data files

tik0 commented 9 years ago

Dear fangq,

with just a few tweaks, we significantly speed up the parsing process of our big JSON data sets (>> 300 MB). Do you have some test data sets, with which we can evaluate our improvements? If it works fine with yours, we would like to contribute.

Greetings, tik0

fangq commented 8 years ago

@tik0 sorry for the delay, I was in the process of moving to a new office; my work computers were down for some time.

here is a sample JSON data file, shared by one of the users. The file is about 2MB in size, containing about 150k small unstructured objects.

http://kwafoo.coe.neu.edu/temp/419138.json.gz

please try your speedup approach and see how much it outperforms my latest code. If it works better, I would be happy to help you incorporate your changes into jsonlab, and I appreciate your contribution in advance.

okomarov commented 8 years ago

Removing globals and as many regexp() as possible will help a lot. There is one regexp in particular that takes ages.

capture1

20 whole seconds are wasted in creating valid_fields. That is something that can be done much more efficiently.

capture2

okomarov commented 8 years ago

@fangq Why so much complexity for creating valid Matlab fieldnames? For instance, when does the need for unicode2native() arises? Why not simply use the same format for Octave and Matlab.

I have simplified that part, and I basically shaved off the whole 20 seconds from that subfunction, but myabe im missing something.

fangq commented 8 years ago

@okomarov thanks for looking into this

Why not simply use the same format for Octave and Matlab.

unicode2native does not exist in Octave. I called unicode2native in loadjson and native2unicode in savejson because I wanted to maintain a round-trip translation between a unicode string to a hex string. I believe translating it byte-by-byte instead of unicode-by-unicode will lose the integrity of the string.

However, this feature has not been well tested. The only example I have is this script

https://github.com/fangq/jsonlab/blob/master/examples/demo_jsonlab_basic.m#L159

okomarov commented 8 years ago

As a side note, to get proper conversion, you need to specify the encoding:

unicode2native('绝密')
ans =
   26   26

This happens because my default character set does not cover chinese ideograms

feature('DefaultCharacterSet')
ans =
windows-1252

If you use 'UTF-8', you get the expected result

native2unicode(unicode2native('绝密','UTF-8'),'UTF-8')

However, the chinese ideograms are in this case 3 bytes long under UTF-8:

sprintf('%X',unicode2native('绝密','UTF-8'))
ans =
E7BB9DE5AF86

fangq commented 8 years ago

@okomarov yes, I was aware of the dependency to DefaultCharacterSet, see Item#2 in the Known Issues and TODOs in the README

http://iso2mesh.sourceforge.net/cgi-bin/index.cgi?jsonlab/Doc/README#Known_Issues_and_TODOs

JSON spec requires all strings are Unicode strings, I am not sure if I should force to use UTF-8, as I assume other Unicode formats may also be valid.

Frankly, I am surprised on the overhead of regexprep, because essentially nothing was changed in that line. If we can convert a unicode string to a hex-key without losing information, perhaps we can somehow convert it back to unicode in loadjson. I just haven't found the combination yet.

okomarov commented 8 years ago

When you said translating byte-by-byte you meant this?

double('绝密')
ans =
       32477       23494
>> char(double('绝密'))
ans =
绝密

fangq commented 8 years ago

the main goal of validfield() is to convert any JSON "name" field (can have multi-byte unicodes) to a valid matlab variable name (a string with only [0-9a-zA-Z]). Since you have identified the hot-spot in the unicode handling line in this function, I was debating if there is an alternative to achieve a fast conversion without losing information.

okomarov commented 8 years ago

double() should not lose info. Matlab uses UTF-16 to encode char and any conversion with double should be fine. I'll test the double conversion (and dec2hex) and see the speedup. If acceptable, ill submit a PR

fangq commented 8 years ago

@okomarov I was hoping that conversions are only needed for multi-byte unicodes, for ASCII, I'd like them to stay the same.

jerlich commented 8 years ago

@okomarov how did the test go? I really like the features of jsonlab but i am concerned with the speed. Would it be worth rewriting this in C ?

okomarov commented 8 years ago

@jerlich If you have R2016a, matlab.internal.webservices.toJSON() and matlab.internal.webservices.fromJSON() are already mex-ed.

jerlich commented 8 years ago

@okomarov I have tried christianpanton/matlab-json and the speed-up is massive (10-50x). I got it compiled on amd64, but haven't managed to get it compiled to maci64. Are you using the built-in matlab service now? I could push my team to upgrade to 2016a. I guess that is a good feature.

tik0 commented 8 years ago

@fangq was a bit lost ;). I just forked your repo and add a simple wrapper script for parallel parsing just by pre-formatting the code. With this commit https://github.com/tik0/jsonlab/commit/67aaaf9f425ca05c27a54fc34d45d083669565ae I've add a new script called loadjsonpar.m which first separate all objects, and then parse it by your parser. By just executing your script on my files, the time grows exponentially while using mine, it clearly grows linear with the number of objects. The evaluation can be done on your PC using example/evaluation.m

All my JSON files have a bunch of objects in a file like this: {object1}{object2}{...}.... They are in fact ¸RFC 4627 compliant so the wrapper script is as well. Would be nice to have such a extension in your jsonlib, because it would ease some peoples waiting ;). If you need help, just ask!

Here some evaluation with my old CoreI7 4700:

ST: Single thread using loadjsonpar
MT: Multi thread with 2 parallel worker using loadjsonpar
CO: Common execution using loadjson

Parsing 10^2 objects:
ST: 0.9076 seconds
MT: 0.9941 seconds
CO: 0.8861 seconds

Parsing 10^3 objects:
ST: 8.5400 seconds
MT: 6.4673 seconds
CO: 9.6118 seconds

Parsing 10^4 objects:
ST:  91.8963 seconds
MT:  58.0130 seconds
CO: 270.1061 seconds

fangq commented 8 years ago

@tik0, looks like an interesting idea

I am curious what did you mean by "CO: Common execution using loadjson"? why CO is significantly slower than ST for 10^4 elements?

tik0 commented 8 years ago

@fangq if you have a look on the simple script, the "CO" part is just the execution of the standard command "loadjson(jsonFile, parsingOptions)". Actually, I am a bit confused as well but I think it is because of the fact that you process the variable "inStr" in many operations which holds the whole JSON file. This might be inefficient for large files. On the other hand, I don't think that it is because of the JIT, because of the recursive parsing characteristic (But who knows what the Matlab-Magic does there).

fangq commented 8 years ago

do you mind attaching your 10^4 test data file so I can profile loadjson?

tik0 commented 8 years ago

Just have a look on the commit https://github.com/tik0/jsonlab/commit/67aaaf9f425ca05c27a54fc34d45d083669565ae . It is the file examples/10000.json.

fangq commented 7 years ago

thanks everyone for useful comments. I am working on making a new release of jsonlab and thought that it would be a great ideal to accelerate the code and close this issue.

in the latest commit (https://github.com/fangq/jsonlab/commit/8a26d68776a9e65867ee4f5b93e030daeec64066), I made two changes, both were discussed previously -

disabling/bypassing unicode2native when no multi-byte-character is detected
cut the use of global variables, especially the input JSON strong (inStr).

the results of these changes yielded a over 2-fold speed up for the previously included test dataset ( http://kwafoo.coe.neu.edu/temp/419138.json.gz ). Here are the timing outputs when running the benchmark on a new desktop (i7-6770k+DDR4 mem)

%%% old loadjson %%%%
>> tic; dd=loadjson('419138.json');toc
Elapsed time is 27.633101 seconds.

%%% updated loadjson %%%%
>> tic; dd=loadjson('419138.json');toc
Elapsed time is 12.351393 seconds.

%%% matlab built-in JSON parser %%%%
>> tic;ss=matlab.internal.webservices.fromJSON(fileread('419138.json'));toc
Elapsed time is 15.474570 seconds.

the optimized loadjson turns out to be 20% faster than the hidden builtin fromJSON function for this benchmark, which I feel quite happy.

if you are interested, please checkout the latest version, and try it on your data and see if there is any improvement.

okomarov commented 7 years ago

As a side note, I think since r2016a, matlab has an official json encode and decoder which is mexed.

Sent from my iPhone

On 2 Jan 2017, at 21:10, "Qianqian Fang" notifications@github.com<mailto:notifications@github.com> wrote:

thanks everyone for useful comments. I am working on making a new release of jsonlab and thought that it would be helpful to accelerate the code and close this issue.

in the latest commit, I made two changes, both were discussed previously -

disabling/bypassing unicode2native when no multi-byte-character is detected
cut the use of global variables, especially the input JSON strong (inStr).

the results of these changes yielded a over 2-fold speed up for the previously included test dataset ( http://kwafoo.coe.neu.edu/temp/419138.json.gz ). Here are the timing outputs when running the benchmark on a new desktop (i7-6770k+DDR4 mem)

%%% old loadjson %%%%

tic; dd=loadjson('419138.json');toc Elapsed time is 27.633101 seconds.

%%% updated loadjson %%%%

tic; dd=loadjson('419138.json');toc Elapsed time is 12.351393 seconds.

%%% matlab built-in JSON parser %%%%

tic;ss=matlab.internal.webservices.fromJSON(fileread('419138.json'));toc Elapsed time is 15.474570 seconds.

the optimized loadjson turns out to be 20% faster than the hidden builtin fromJSON function for this benchmark, which I feel quite happy.

if you are interested, please checkout the latest version, and try it on your data and see if there is any improvement.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fangq/jsonlab/issues/9#issuecomment-270020613, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADju5fjUpYhjv803sWCWsfeqwcYxQ8dGks5rOWfHgaJpZM4F5je8.

fangq commented 7 years ago

@okomarov, even that's true, I can still see plenty of reasons to continue investing time and improving this toolbox. this toolbox not only works for matlab, but also octave, and can be helpful for open-source users. it is already distributed by some distros

https://admin.fedoraproject.org/pkgdb/package/rpms/octave-jsonlab/

also, the UBJSON support is unique with jsonlab.

I did try jsonencode/jsonencodeon my laptop running matlab 2016b. jsonencodeis lightening fast, however, it currently does not support complex and sparse. This can be a headache for some users. I also ran the benchmark json file with jsondecode, it is about 20% slower than the latest loadjson, similar to matlab.internal.webservices.fromJSON.

fangq / jsonlab

Speedup parsing large JSON data files #9