clj-commons / hickory

HTML as data
Other
637 stars 52 forks source link

Jsdom #20

Closed skrat closed 1 year ago

skrat commented 10 years ago

Partially fixing #17 , I could only observe parse-fragment to not work as expected. Additionally replacing wholeText acessor for nodeValue as it's use didn't yield the correct text in both node.js and browser.

davidsantiago commented 10 years ago

@jeluard I made some comments on this after looking at the code. Any thoughts?

jeluard commented 10 years ago

The wholeText change looks a little suspicious. wholeText seems more adapted in this case than nodeValue and actually reflects the original clojure code. I would be interested in seeing the inconsistencies due to its usage. Maybe nodeValue could only be used for the nodejs implementation? Also it looks like there are still some wholeText usages after the change. Is that intended?

jeluard commented 10 years ago

Overall the updated code looks a little harder to follow than before. Maybe a different approach could be to polyfill the DOM API currently used on top of jsdom? This way the implementation would be more agnostic (as opposed to jsdom specific), assuming we are currently relying on the standard w3c DOM API.

davidsantiago commented 10 years ago

Hm, yeah, I understand very little of what you just said. Can you shoot me an email at [my github username] at gmail?

skrat commented 10 years ago

Relevant question about text nodes http://stackoverflow.com/questions/12286975/when-working-with-text-nodes-should-i-use-the-data-nodevalue-textcontent

I haven't seen wholeText used before, it looks like DOM Level 3 property. As said, it's usage didn't yield expected results, and I don't see this property to be working in both node.js and browser environment. It might work in clojure, but in clojurescript it doesn't.

As of line 62, I simply wanted the node-type to be the same on both platforms. Alternatively you can provide note-type definition in bot branches of the big (if.

@jeluard I don't get the last comment either. More agnostic, but how exactly? By polyfilling, do you mean injecting jsdom constructors to the globals? That doesn't sound good to me.

skrat commented 10 years ago

The DOMParser-fromString change IMO is fully backwards compatible, I've only seen it ((DOMParser.)) used in public parse function where I replaced it with single DOMParser-fromString call, that either uses browsers (.parseFromString (js/DOMParser.) s) or node/jsdom's function call. I agree that the big if with defs is not so nice, but works pretty well without any further ceremonies. Let me stress again that my point is not to solely make it work on node.js, I need it to work equally well in browser too, and I don't see any breaking changes, although I haven't used hickory before, so it's hard for me to judge the wholeText change. I just don't see this property working.

jeluard commented 10 years ago

wholeText definitively works with ClojureScript both under Chrome / Firefox and phantomJS / slimerJS. I use hickory to parse HTML every day in browsers and never got an issue with wholeText. Can you detail what kind of issue you are talking about? Do you get unexpected parse results or do you see environment no implementing wholeText? In the former case it would be great to have new tests highlighting the wrong behavior.

skrat commented 10 years ago

@jeluard I see it now, wholeText seems to be present in the output of DOMParser, but it isn't available on the DOM found directly in the browser. Eg. when you open any page in Chrome/FF, jump into console, and run document.querySelector('a').childNodes[0].wholeText (assuming there's text node in that random anchor), you'll get undefined.

Anyway, the point is that jsdom doesn't have it, and for Text nodes, nodeValue from DOM Level 1 should give you what you exepct: the contents of a text node.

Could you explain why you chose to use wholeText instead of nodeValue for Text nodes?

jeluard commented 10 years ago

Regarding polyfilling: current ClojureScript implementation follows the official w3c API as far as I can tell. Usually in JS world (maybe more specifically browsers, I am not familiar with nodejs) when features are missing from a browser (pretty common) you polyfill them. The idea is to implement the API as it should be using JavaScript. In our case this would mean rely on jsdom to implement what hickory currently use so that no change is needed to the ClojureScript implementation. This can be done directly in an extra JS file that would be provided to nodejs or directly in core.cljs.

I believe this approach could be superior because:

@skrat what do you think? Is it something you usually see in nodejs world?

jeluard commented 10 years ago

Maybe nodeValue could be used as a fallback when wholeText does not exist?

skrat commented 10 years ago

Unfortunately jsdom is the only viable option so far, the only one exposing (exporting) constructors for NodeList, Node and NamedNodeMap. Another thing is that is uses AttributeList for node.attributes. We'd still have to extend that one, and only when using jsdom.

So while the idea of polyfilling, and keeping the original code intact sounds (of course) very good, in practice, it's not so easy. It also is not a common pattern in node.js.

It is true that the parsing bit can be done by the client of hickory, but making people extend types and all that is not so nice. Additionally I would also have to patch jsdom to support wholeText. IMO it's easier to stick with particular implementation.

For the rest, I would be really interested in the explanation of why you use wholeText? To be honest I don't even get its existence. It says it represents the contents of this text node, and all adjacent ones. But in practice you don't have parsers that produce adjacent text nodes, or do you? Even if that'd be the case, then you would end up with that text occuring multiple times in the parsed structure, because for each text node, you would end up with text from all neighboring nodes.

IMO what you want is the text content of the node currently being parsed and nothing else, ie. what nodeValue has.

jeluard commented 10 years ago

wholeText was part of the original clojure implementation and has been used in the ClojureScript one to mimic the original behavior. Probably we should stick with its usage for platform that provide it and have a fallback for those which don't. An option is to simply check for its existence and fallback to nodeValue when needed. If you have problematic tests with the current implementation it would be a great addition.

As for the other comment it's more about style, I personally don't care that much as long as hickory still works as it used to with browsers. No much style difference in this regard between Clojure / ClojureScript.

In any case it's a good thing to have hickory support on more platforms.

skrat commented 10 years ago

Ok, so is (or (aget this "wholeText") (aget this "nodeValue")) acceptable in both as-hickory and as-hiccup? About the jsdom / polyfill issue, is the current solution in this pull request acceptable or shall I change it somehow?

jeluard commented 10 years ago

Yes (or (aget this "wholeText") (aget this "nodeValue")) sounds better.

As for the change style, I am not quite sure how nested def / defn are supposed to work. I remember some comments on Clojure / ClojureScript mailing list regarding that king of usage. It might work now but may break at some point. Would there be a way to avoid those nested defs?

skrat commented 10 years ago

I think so, but that would mean doing the requireing and ifs all at the runtime, every time you call to parse, instead of just once during the load time, which is better IMO.

jeluard commented 10 years ago

Maybe it would be worth starting a thread on ClojureScript mailing list to check what is the idiomatic way to handle this scenario?

skrat commented 10 years ago

I did so over here. Additionally I have to use this line to have it working with master

(js/Object.defineProperty (.-prototype Text) "wholeText"
  #js {:get (fn [] (js* "this.nodeValue"))})
njordhov commented 8 years ago

wholeTextmay lead to duplicate text whenen used in clojurescript. Issue #34 provides a replicable case. Fixed in pull request #35.

jeluard commented 8 years ago

Have you considered using nodeValue as suggested in this thread? Also it would be great to align both clojure and clojurescript code. Do you mind testing the same change for clojure?

njordhov commented 8 years ago

Using DOM's nodeValue is a reasonable replacement for the incorrect wholeText property. Another option (suggested in PR #35) is goog.dom/getRawTextContent, taking advantage of the google closure dom library which is available for clojurescript both for browsers and nodejs. Google's implementation actually calls nodeValue but using the abstraction could have forward compatibility benefits. Neither are available for jsoup used in the clojure implementation.

port19x commented 1 year ago

Closing, as #17 has been fixed by #33 and I see no value added by this change, not to speak of the massive merge conflicts that would need to be resolved to make this compatible with current hickory