clj-commons / hickory

HTML as data
Other
637 stars 52 forks source link

ClojureScript support #7

Closed jeluard closed 11 years ago

jeluard commented 11 years ago

I am considering adapting hickory to have ClojureScript support.

Is this something you would be interested with?

davidsantiago commented 11 years ago

Absolutely. I don't know anything about ClojureScript and haven't ever tried it, so this would have to be something where you kind of take the lead on it, but I'd be happy to have that capability in Hickory. Probably the biggest problem would be the dependency on JSoup for HTML parsing, we'd need some equivalent for JS (perhaps there is one, I don't know what that would be). What were you thinking?

jeluard commented 11 years ago

Great!

I have not yet investigated any of those (especially with regards to badly formed html) but here are some options I was considering:

I still have to test those to see what are the tradeoffs/advantages.

jeluard commented 11 years ago

I made some good progress with the ClojureScript support. I finally chose to rely on the DOM implementation of the platform used (so it won't work on platform without DOM support). It's probably the most correct and efficient HTML parser you can find. Also it allows to parse the current HTML document (as it's using the same API) which sounds promising!

Some comments on the changes:

There are 2 points yet to clarify:

jeluard commented 11 years ago

Sorry looks like I pushed on the wrong branch again (I'm using the hub script). Not sure how to fix that now.

davidsantiago commented 11 years ago

OK, looks like good progress. Lots to unpack here. First of all, maybe the comment-after-doctype thing is not an issue; my HTML/DOM/JS expert tells me that comments are only technically supposed to be inside the HTML tag, which I did not realize before. He also tells me that "whitespace parsing is magic." In fact, I'm pretty careful to always state in the documentation that the output of Hickory is in terms of "equivalent" representations, never exact, for these sorts of reasons.

Copying html-escape out of quoin makes me unhappy. quoin exists because I maintain several of these engines now, and I was duplicating too much code. I'm really big on keeping things manageable. I'm not saying you have to port quoin (though I would be surprised if it was too far from that being easy, as it is now), more on that in a sec.

I think using a browser's DOM parser is fine, though I guess I'm a little confused as to what the use of the library is, at that point. The ways of JS developers are mysterious to me, so I will grant that a purpose remains. Dommy does it this way, and clearly they have their reasons. In this case, I don't think that you will get a doctype out of the parse; that seems fine. Nothing in Hickory requires you to be working with full documents if that is not your need/environment. That the format can do so if it is asked is enough.

However, I think mashing this code together is turning into a nightmare. Your patch essentially rewrites hickory.core with the CLJS version, when there is apparently very little code-sharing possible there. It resembles some very cross-platform C program, with stretches of code written multiple times inside various preprocessor guards. Can't we take a file like that and just have core.clj and core.cljs and not try to mash them together? When there's a large amount of conditional compilation, I think it's far preferable to hide the details in their own files and just cooperate on the interfaces if at all possible, and I think it is here.

Most of the changes to select.clj appear to be because clojurescript doesn't support refer-clojure renames, is that right? That is a real bummer, but I could live with that change. Looking at the code to the line that uses into-array, it looks like the array type is not actually used in the clojurescript version of into-array, so you could just leave out the cljs version.

In the changes to the core tests, again, I'm feeling too much conditional compilation gimmickry. If things parse differently, that is not key to what is being tested in these tests, so let's just rework the tests so that they are something less "controversial" and thus parse the same, which please tell me that's possible. In the error-handling test, did you check to see if changing clojure.lang.ExceptionInfo to just ExceptionInfo would work? Usually things in clojure.lang don't need to be fully qualified, though I guess I did that for some reason so maybe it won't work that way.

Similar comments apply for the select tests... can't we make a starts-with utility function, write the two versions, and then not stick the conditional compilation everywhere?

Finally, just wanted to mention that I've been doing some coding on hickory and have done some code reorganization that affects this patch. I think it might ultimately be a good thing, though, because the code is more fine-grained; I moved the rendering functions out of core and into their own namespace, hickory.render. Similarly, I moved some functions into a new utils namespace for easier sharing. If instead of porting quoin, you just have only partial cljs coverage and hickory.render is not available in cljs, that is alright with me; I've long said I was not comfortable with the implementation of the rendering function. Though, I did add a new namespace, hickory.convert, which converts between hiccup and hickory versions, and it needs hickory.render to do one of the conversions, so that would also not be available. But it didn't exist before, so presumably you wouldn't miss it!

I'm pushing my in-progress changes up to the develop branch shortly.

jeluard commented 11 years ago

Wow that is a big comment!

I will make changes according to those and probably take this opportunity to re-create the pull request on the correct branch. I will wait for your refactoring as it probably will help with regard to some of your comments.

Some specific answers:

maybe the comment-after-doctype thing is not an issue Then probably it would make sense to change the html sample used in the example. Note that firefox correctly parses it but using an API not supported by chrome.

Copying html-escape out of quoin Ok. I will port quoin to ClojureScript too then, if that's fine with you.

I guess I'm a little confused as to what the use of the library is, at that point You mean from ClojureScript? My specific usecase involves importing html fragments from some blogging platform (say wordpress) through their API, parse them (a post is usually provided as a single html block) and transform it to my own data format. Then eventually back from my format to their html. Much simpler to do this by manipulating maps than DOM elements.

However, I think mashing this code together is turning into a nightmare. Makes sense. Depending on your refactoring I will create distinct files for Clojure/ClojureScript.

clojurescript doesn't support refer-clojure renames, is that right? Exactly.

so let's just rework the tests so that they are something less "controversial" and thus parse the same Makes sense. I will rewrite those.

check to see if changing clojure.lang.ExceptionInfo to just ExceptionInfo No it does not but maybe some import can help. Also if you are fine with some slight changes in the ns form I can probably reduce the amount of conditionals. ClojureScript has a bunch of limitations here.

can't we make a starts-with utility function Sure will do that.

Thanks for you comment!

davidsantiago commented 11 years ago

Great, I think that's gonna turn out well. I pushed my refactor up on the develop branch, so you should be able to take a look any time. I think it's basically done, but I think it'd make sense to wait for your stuff for 0.5.0.

Porting quoin is fine. Like I said, I think it shouldn't be too hard. There's only about 7 functions, and only two of them use anything other than pure clojure.core functions, and one of them you already did.

To be clearer, I think separate clj/cljs file for hickory.core is appropriate. For other namespaces, with only minor changes, I'd say just go for a cljx. And if you see some way to change the ns forms to reduce conditionals, please do send that along. I assume that would involve getting rid of :use forms, which is fine. In general, I favor code that is as widely applicable as possible, with any conditional compilation pushed off of the main gist of a function into a utility function that behaves the same on both hosts, if that makes sense. I don't like to see conditional compilation just because there's some difference between clojure and clojurescript where that is not at all relevant to the point of a function. I'm pretty frustrated that clojurescript stopped so far short of pure source compatibility with clojure, really.

jeluard commented 11 years ago

About quoin: I found a simpler alternative which is using quoin for the Clojure part and a native google closure method for ClojureScript. This way I don't have to port quoin (I run into weird issues doing that).

jeluard commented 11 years ago

Hmm looks like I messed something when deleting my cljs branch.. Not sure how to clean that. Is it fine if I reopen a new pull-request with my new changes?

davidsantiago commented 11 years ago

That is a better alternative than porting quoin, go ahead with that. Not sure what you're asking about the branch, but yes, a new PR is fine.