Optimization for large files

erichulburd commented 10 years ago

I tried parsing a large file (6MB) and it didn't finish after 5 minutes. I tried a few raw Nokogiri searches and it's slow, though in terms of tens of seconds rather than tens of minutes.

Not sure where to start with this. I've heard conflicting suggestions about whether xpath is faster (http://nicksda.apotomo.de/2013/01/nokogiris-xpath-search-is-faster/) or tree walking (http://one.valeski.org/2009/10/nokogiri-performance-xpath-vs-tree.html). I'll have to run some tests.

Before I even do that, I'm going to try to move as much of the initialization code into methods.

Also, as far as the entry_xml variable goes, it looks like that actually stores an xpath, not the result. So every reference will look through the entire XML block to find the node. It may speed things up to remove the node from the DOM once the greenButtonEntry is initialized and store it as a string in thegreenButtonEntry.entry_xml variable??

cew821 commented 10 years ago

Thanks for reporting. Does it hang on the "load" command? i.e.

gb = GreenButton.load_xml_from_file("file.xml")? or does it only hang once you start interacting with the parser objects?

I am going to investigate an alternative parsing method proposed by one of my colleagues @jateeter. This method involves instantiating ruby objects for all of the entries in the file first, and then going back to assign the object relationships in a second pass, based on a dictionary object that you create that stores each entry's related hrefs and a pointer to the object created out of the entry.

The theory is that searching the whole DOM for related objects by xpath on each IntervalReading is potentially expensive (but I haven't actually benchmarked). To investigate you could try replacing the lookups for related links with a hardcoded value and see if that changes the behavior.

jateeter commented 10 years ago

a bit of refinement on the alternative Charles mentions - think of it as an incremental garbage collector. As Charles describes it, you have a mark-sweep garbage collector (where the mark phase is the internment/instanciation of the Resource from the and the sweep phase is the fix-up of the associations based uponof rel="up" and rel="related" patterns - using HashMaps for example).

You can also think of it as an incremental garbage collector to minimize the memory footprint of the process, but that POV is only useful if you 1) have a persistent storage (DB) that allows you to take advantage keeping the mark collection small and/or 2) you want to keep the pattern matching hashmap of minimal size.

jt

On Tue, Mar 18, 2014 at 5:57 AM, Charles Worthington < notifications@github.com> wrote:

Thanks for reporting. Does it hang on the "load" command? i.e.

gb = GreenButton.load_xml_from_file("file.xml")? or does it only hang once you start interacting with the parser objects?

I am going to investigate an alternative parsing method proposed by one of my colleagues @jateeter https://github.com/jateeter. This method involves instantiating ruby objects for all of the entries in the file first, and then going back to assign the object relationships in a second pass, based on a dictionary object that you create that stores each entry's related hrefs and a pointer to the object created out of the entry.

The theory is that searching the whole DOM for related objects by xpath on each IntervalReading is potentially expensive (but I haven't actually benchmarked).

Reply to this email directly or view it on GitHubhttps://github.com/cew821/greenbutton/issues/4#issuecomment-37929067 .

erichulburd commented 10 years ago

@cew821 yup hangs on the loading. Super slow. what @jateeter mentioned is exactly was what I had in mind. I hadn't put much thought into it at first just because I wanted to get the class structure down, but the way I've done it creates a lot of upfront processing and memory demand - for info the user may not even what.

Does it make sense to either of you to remove XML <entry/>'s into the objects as strings once they are instantiated? That might make subsequent DOM parsing faster, as well as speed up look ups within that particular entry?

Let me know as you make changes. Don't want to duplicate the effort.

erichulburd commented 10 years ago

I've had to parse a few bigger files for a project I'm working on, so I created a branch - deferred_parsing. Basically, it removes the entry xml from the dom and doesn't parse it until an attribute is asked for by the user. This sped things up considerably, though it still seems really slow.

I've had a co-worker share some XML parsing snippets for Postgres with me and it runs really fast. I may work on trying to include some sql statements in the next few weeks.

Also, @cew821 - now that I've been able to parse the historical file from SDGE, I noticed that they've got some flaws in their mark up. In the file I sent you, every UsagePoint and MeterReading has the same "self" href - so it you actually can't link the meter_readings to their UsagePoints. As a result, all 396 meter_readings are appended to the first UsagePoint, and the other 395 UsagePoints have no meter_readings.

That's annoying, and it's not the case in their daily reporting files. I've let them know and am hoping they fix it, but we'll see.

AndrewJo commented 9 years ago

@Arbolista, I don't know if you solved the performance issue with deferred_parsing yet, but I've been working on a Ruby gem that can parse 6 MB XML Atom feed of IntervalBlocks in just over 3 seconds (vs. 5 minutes). It utilizes an event driven SAX (Simple API for XML) parsing method which, instead of building the entire DOM in memory (like Nokogiri does by default), parses pieces of XML sequentially.

Benchmarks on a 1 year, 5 minute interval data (~25 MB) takes just under 15 seconds on a MacBook Pro without running into out of memory exception on the Ruby process.

The gem is heavily optimized for vanilla OpenESPI and PG&E Green Button format but it should be trivial to work around SDGE formats (if they haven't fixed it already). It also includes an API client for accessing data directly from RESTful API endpoints.

jateeter commented 9 years ago

1+ on the need for SAX. the java code base uses it heavily. it's really necessary for insuring multiple passes through the DOM are not needed. In the java code base, I also had problems with the post-parsed-but-yet-to-move-to-persistent-storage memory intensity. Flushes at clean points were necessary.

Let me know, if you like, as these (Ruby/Python/etc. libs move forward and we'll make sure they are listed where people can easily find them (www.greenbuttondata.org).

AndrewJo commented 9 years ago

@jateeter, I haven't looked at the Java codebase but after monitoring the Ruby process which performs the parsing on my gem, the memory usage appears to grow proportional to the file size which is expected of a SAX parser. On the 25MB test file on greenbuttondata.org, it uses somewhere up to ~130MB on my machine running MRI Ruby 2.0 by the time it finishes running the benchmark. It's not great but I figure 1 year, 5 minute interval data is probably extremely granular in most cases anyway.

I released the gem under BSD license from my company to help the community so listing it at www.greenbuttondata.org should help folks who are looking for a SAX parser in the Ruby ecosystem.

cew821 commented 9 years ago

cc @doomspork -- want to make sure he saw this. I added Doomspork as a gem collaborator a while back, but doesn't look like he's pushed any changes. In any event, since I don't work much on GreenButton these days, it would be great for @Arbolista @AndrewJo and @doomspork to all at least know what each other are up to.

doomspork commented 9 years ago

Howdy @cew821! I don't think I was aware that you had finally added me so I worked on my own fork. Good to know though, thanks.

Sax support was/is on my list of things to tackle so I'll definitely check out @AndrewJo's gem.

Thanks for the heads up.

cew821 / greenbutton

Optimization for large files #4