bpsm / edn-java

a reader for extensible data notation
Eclipse Public License 1.0
100 stars 24 forks source link

Simplify the Printer interface by removing the close() operation. #16

Closed mikera closed 10 years ago

mikera commented 12 years ago

This is not needed in general, and where it is needed then the caller should take care of closing the underlying Writer (see the CustomTagPrinter test for an example of this usage)

bpsm commented 12 years ago

Today, while updating a client to use 0.2.0-SNAPSHOT, I found having close() to be quite handy. When the Parser and Printer have close methods, I don't need to hang on to a separate reference to the Reader they hold in order to clean things up. That means that things like this work:

Parser p = Parsers.newParser(cfg, createAReader());
... later
p.close();

This kind of thing works:

Parser p = getParserForUrl(url);
...
p.close();

Without a close method, I'm forced to do this long-hand. I'm stuck with a code pattern I can't encapsulate in a method. In short: boilerplate:

Reader r = makeAReaderFromUrl(url);
Parser p = newParser(cfg, r);
...
r.close();

This is also why I've tried to be careful in Parser, Printer and Scanner to close the subordinate resource in case of an exception during construction or use: to allow this kind of composition safely without incurring resource leaks.

Do you see any holes in that analysis?

mikera commented 12 years ago

I certainly see your point

Though ultimately I'd expect the caller to write their own method that encapsulates the entire transaction and include other stuff like logging, validation, URL construction etc. e.g.

someObject.readEdnFromService("http://foo.com/some/service");   // does everything inside this method.

So any "boilerplate" would get hidden in this sort of method, which is what I would generally expect whenever you are doing non-trivial IO.

Just for comparison, this is how Kryo does it:

Kryo kryo = new Kryo();
// ...
Output output = new Output(new FileOutputStream("file.bin"));
SomeClass someObject = ...
kryo.writeObject(output, someObject);
output.close();
 // ...
Input input = new Input(new FileInputStream("file.bin")); 
SomeClass someObject = kryo.readObject(input, SomeClass.class);
input.close();

Note the following features that I quite like:

Ultimately I think the issue we have here in the edn-java API design is that we are complecting the parser with the job of wrapping IO, which I think are fundamentally different things and should be separated. The presence of the close() operation on our Parser is just a symptom of this.

How about something like this:

Parser p=Parsers.newParser(cfg);
Reader r = makeAReaderFromUrl(url);
EdnReader r=new EdnReader(p,r);
....
r.close();

This way the Parser logic gets separated from the Reader / IO wrapping implementation

Even better, with a shared kryo-style config object and some utility methods we can hide the actual parser object and do:

Config endCfg = .....

EdnReader r= new EdnReader(ednCfg,url);
....
r.close();

EdnStringBuilder esb=new EdnStringBuilder(ednCfg);
....
String ednString = esb.toString();

EdnWriter w = new EdnWriter(ednCfg, someWriter);
....
w.close();

What do you think?

I know we might need to do some refactoring, but I think it is really important to get the API design right early on so worth spending some time / brainpower on this.....

mikera commented 12 years ago

Hmmm with a bit of further thought I'm now even more convinced that it would be a good idea to separate out Parser from EdnReader because it means that we can make Parsers stateless. This would have a bunch of extra advantages:

bpsm commented 12 years ago

At some level, parsing is inherently stateful because the underlying source of characters is inherently stateful: we want to be sure that we read all the characters exactly once and in the the correct order.

I intend Parser/Scanner light-weight single-use objects. If you want to parse from more than one source of input -- wether on the same thread, or from multiple threads, use multiple Parsers. Using the same parser from multiple threads makes as little sense as trying to parse from the same source of characters from two different threads. Talk about race conditions.

Nevertheless, I've been hacking on a branch which aims to push all the unavoidable state down into a PushbackReader, while moving the transient state into arguments passed between the methods of Parser and Scanner. I've succeeded in making the Parser stateless, except for it reference to the scanner. I'm currently working on making Scanner stateless. Once both are done, one could theoretically replace Parser and Scanner with a set of static methods such that Parser.nextValue(PushbackReader) would do the right thing. (This is basically how Clojure's own reader works.)

I've not decided wether this is a good route or not yet, but I thought I'd try it out and see how it feels. In the case of the parser, it's a mixed bag: some things are simpler/shorter but other things are more difficult to grok. We'll see how things wind up with the Scanner.

I'll push it up to a topic branch on github when it's reached a useful state.

mikera commented 12 years ago

Cool - think this is the right general direction (but then again, I'm a fan of the Haskell-style "parser as a pure function" approach)

You might find that stateless Parser / stateful Scanner is actually a reasonable compromise. I'm sure I've seen that approach work well somewhere else though can't remember exactly where.

Let me know if you need any help hacking on it!

bpsm commented 12 years ago

I've pushed the branch topic/stateless-parser. Please have a look and let me know what you think.

bpsm commented 12 years ago

A note on performance.

Switching to PushbackReader seems to have cost considerable performance, at least in the scenario I'm measuring. I've got a 5.5 million character long input consisting vectors of small maps with keyword keys and integer, boolean, string, BigDecimal and TaggedValue values. I parse it repeatedly for 20 seconds and then compute the parsing rate in terms of millions of characters per second. Here's the results on my 2010 MBA on the revisions of topic/stateless-parser:

|---------+-----------+---------------------------------------------------|
| commit  |  M*char/s |                                                   |
|---------+-----------+---------------------------------------------------|
| 0b41577 | 18.793317 | replace PushbackReader with Parseable             |
| 7151592 | 11.749747 | small refactorings in scanner                     |
| 699c4e2 | 11.326853 | eliminate need for pushback > 1                   |
| 249daf0 | 11.478621 | inline parseFloat, parseInteger                   |
| 2e5c8e7 | 11.276815 | arrange for pbr to begin with curr                |
| dc0e0a1 | 12.349021 | remove mutable state from Parser+Scanner          |
| 1a66a13 | 11.244196 | Confine mutable state to pushbackReader           |
| 5aa49c1 | 10.211895 | first step toward PushbackReader                  |
| 39d4958 | 15.155932 | ParserImpl free of mutable state (except scanner) |
| 7c17c45 | 16.296374 | make Scanner faster by caching current char       |
| cdeb66d | 11.659712 | ...                                               |
|---------+-----------+---------------------------------------------------|

"caching current char" was clearly a performance win; switching to PBR, not so much. Eliminating current char as an explicit argument to parse procedures (2e5c8e7) cost additional performance, which even "small refactorings in scanner", which reintroduced the current char parameter in some cases, could not regain.

Replacing PushbackReader with Parseable (0b41577), seems to gain us considerable performance at least when parsing from a String in memory.

mikera commented 12 years ago

Hi Ben,

I like the stateless parser aspect of the new branch - looks good and clean.

I'm marginally less keen on the stateless scanner / use of PushbackReader and Parseable. I think I liked your old stateful scanner better, this gives a few advantages:

Conceptually I would think of it like this:

mikera commented 12 years ago

See this commit for what I had in mind:

https://github.com/mikera/edn-java/commit/8dbd4105a6d1b87ee71fcd5b8107d6505c3414a9

bpsm commented 12 years ago

It doesn't make sense to me to go half way here.

Either the Parser is a stateful single-use object, as it is now on master, or it is immutable, as currently on the branch.

I think that there is little to be gained by having the top half of the Parser be stateless, while keeping it's bottom half stateful (the Scanner). You'll note that the Scanner is not public. I consider it an implementation detail of the Parser and don't have a great desire to make it part of the documented public interface of edn-java. (But see below.)

Having successfully separated the mutable state necessary to provide LL(1) down into the PushbackReader (or Parseable), I see even less to be gained by complecting this back into the Scanner.

As to the performance aspects, I'm still inspecting that. I'll want to compare the performance of your branch, that of using Parseable and that of master when reading from files as well as from in-memory before I commit to a decision.

Currently, though, I'm leaning toward merging topic/stateless-parser back to master.

Over on issue #17 I mentioned being open to providing hooks to support you in building what you need on top of edn-java. If you thought building your own parser on top of Scanner would help you, I'd consider that a good enough reason to make the Scanner (and the Token enum) public.

mikera commented 12 years ago

I certainly agree with you on all the important points:

The main hesitation I have about Parseable is that it forces a character-by-character reading/scanning approach into the public API. I'm not fully knowledgeable about all the possible ways of implementing a parser+scanner combinations but it seems to me that there might be alternatives / more efficient options for scanning. If we want to leave the door open for different scanner implementations in the future, then my "stateful scanner" approach is just one way of doing that.

Regarding building custom parser implementations: I don't think I really need to know anything about the Scanner implementation as long as I can inherit from a basic parser implementation (possibly an abstract base class) and as long as there is something like a Parser.nextToken(Parseable) method that I can call which in turn calls the embedded Scanner.

Would you like an AbstractParser patch that does this (I would probably suggest making the concrete class ParserImpl extend this also....)?

One the benchmark stuff, I'd suggest making a simple performance test program in the src/test/java source directory with a main class that we can run for a mini performance suite. Again, want a patch for this?

bpsm commented 12 years ago

A patch for a performance test would be appreciated.

bpsm commented 12 years ago

I've merged an expanded performance-test branch to master. It's still rough. See #19.

bpsm commented 12 years ago

43a7e1a5c29d3da73a7078726027eab68eb5eac5 merges the stateless parser to master. This removes close() from the Parser. Discussion relating to this should take place on a new issue.

bpsm commented 12 years ago

close() on the Printer is currently still present. The printer is still statefull. I have yet to investigate pushing the Printer's state (the Writer and softspace) down into an object that can be passed as a parameter to the Printer. Would that be a win?