cstroe / svndumpapi

A Java library for manipulating a Subversion dump file.
GNU Affero General Public License v3.0
5 stars 1 forks source link

SvnDumpFileParser.readByteArray() is very slow #3

Closed cstroe closed 9 years ago

cstroe commented 9 years ago

Reading large files that are part of an SvnNode takes a disproportionately long time. We currently read the file content of an SvnNode with the readByteArray() method. This is slow for some reason, and I don't know why.

Need to fix this, so that we don't take a very long time reading streams.

cstroe commented 9 years ago

Something strange is going on. Looking at this:

Reading 28.7 MiB ... done in 142.3 s                                            
Reading 28.5 MiB ... done in 411.0 ms 

That's in the same revision.

Another egregious example:

Reading 59.2 MiB ... done in 2472 s

This seems to be the major cause of slowdowns in reading streams from STDIN.

cstroe commented 9 years ago

It looks like we are spending a ton of time in ExpandBuff:

profiling

Perhaps I'm using readChar() in a way that it's not meant to be used. It looks like it's not made to read many characters at one time, because ExpandBuff keeps incrementing the buffer 2048 each time. If reading a large file, megabytes long, the buffer will be expanded many times, and each expansion is costly.

cstroe commented 9 years ago

It seems that someone else has seen this problem with JavaCC's SimpleCharStream:

http://markmail.org/message/zko7diftsjdxvoqd

Subject:    Re: [JavaCC] Performance issue when consuming large token in JavaCC permalink
From:   Sreenivas Viswanadha (sre...@viswanadha.net)
Date:   Feb 17, 2006 8:10:55 am
List:   net.java.dev.javacc.users

One otion is to increase the memory setting to the VM using -Xmx256m or 
something.

If this is a delimited token  - like comments in java and if you don't 
need the actual image, then you can use lexical states and skip the 
token text completely and simply return the token kind when you see the 
end marker.

Yet another option would be to rewrite the generate SimpleCharStream 
class to may be use RandomAccessFile instead of the circular buffer that 
it uses.

> Hi, I hope this is the correct place to post this message.
> 
> I am writing a parser to parse large files using Javacc. Some of the
> tokens can be as big as 3M. I found that once the token size becomes
> close to 1M, the parser becomes extremely slow to consume that token.
> Could anybody tell me how I should tune the parser for large tokens? Thanks!
> 
> David
cstroe commented 9 years ago

Seems these guys also have this problem: https://jira.blazegraph.com/browse/BLZG-478

They also mention that Lucene impemented a FastCharStream.

cstroe commented 9 years ago

Changed svndump.jj to use Lucence's FastCharStream in 656cfbf4f425390b3e87bc15a975b5b59d6a95db. The throughput difference is tremendous, orders of magnitude faster than before.