TheProlog / prolog-use_cases

Use-case layer for Meldd/Prolog application.
0 stars 0 forks source link

MarkdownToHtmlConverter#call method MUST NOT wrap the body content. #23

Closed jdickey closed 8 years ago

jdickey commented 8 years ago

Currently, when you call

MarkdownToHtml.new.call("This is *emphasised* content.").content

you get handed back a string of HTML (reformatted for readability; actual is a continuous string without line breaks) like

<div id="body-content" data-contribution-counter="0">
<p>This is<em>emphasised</em>content.</p>
</div>

This goes back to being excessively presentation-oriented, as we thought we'd picked up on before. This is coming back to bite us anytime we deal with possibly rich content, whether it be proposing an edit contribution, publishing an article, whatever. It needs to be ripped out as default behaviour. Rather, MarkdownToHtml#call should assign just the converted markup to be readable from #content.

We'll want a separate class/decorator module to do presentational shtuff like wrapping content in a div or other container, with the data-contribution-counter attribute along for the ride.

This would also fix the bug that the actual setting of the data-contribution-counter value was obviously something that fell through the cracks, as this line demonstrates. :disappointed:

jdickey commented 8 years ago

Hurrying Only Slows You Down, Again.

And, once again, effort-saving features of libraries tend to have more limited use cases than is immediately obvious.

Consider the line of code

          nodes.map { |node| Ox.dump node, indent: -1 }.map(&:strip).join

when nodes is an array containing an Ox::Element instance created from parsing the HTML

<li>First <em>new</em>\n item</li>

What the Ruby code does is

  1. Dump the node (and any child nodes) to a string; the -1 indent tells Ox to skip inserting newlines between child nodes (and that seems to be the only place that behaviour is documented :disappointed:);
  2. Strip leading/trailing whitespace from the output of each dumped (top-level) node; in the example, there's only one; and
  3. Concatenate the (stripped) strings into a single string.

The output produced from that input HTML is

<li>First <em>new</em>item.</li>

Notice the lack of any space between the end of the em element and the item. string (text node) following it; since em is not a block element, correct browser behaviour is to render the text representations of the two nodes immediately consecutively.

This is not what we want.

It's All So Simple. Except...

It appears that we must walk each node explicitly and inspect its child nodes, such that

  1. if the only child node is a text node (string), then strip leading/trailing whitespace;
  2. if the first of multiple child nodes is a text node (and therefore a child element is the next child), then remove leading whitespace and, if trailing whitespace exists, replace it by a single space;
  3. if the last of multiple child nodes is a text node, remove trailing whitespace and, if leading whitespace exists, replace it by a single space;
  4. for a text node that is neither the first nor last child of its parent element, replace any leading or trailing whitespace by a single space. Do not add whitespace if none previously existed;
  5. for each of the previous three cases, replace empty or all-whitespace content of a text node with a single space;
  6. recurse for child elements.

This is basic stuff, and it just boggles the mind that none of the Markdown- or HTML/XML-parsing libraries evaluated to this point seem to include it as standard tooling.

We Preempt the Next Use Case Gem Version for This?!?

Yep. Not immediately extracting the services we've developed thus far (ContentSelection and MarkdownToHtml) to separate Gems was a conscious decision taken "to speed things up". See the opening title of this post.

jdickey commented 8 years ago

Obviously, the above requires an extensive rewrite of the Prolog::Services::MarkdownToHtml::Dumper class. At the same time, the existing test code for that class doesn't come close to the granularity specified in the above comment, with only a single test for an element with child elements. :unamused:

jdickey commented 8 years ago

"It's A Bit More Complicated Than That"

Right. What we've been doing up to now is iterating over the *top-level nodes) we've been passed in, using code very much like

nodes.map { |node| Ox.dump(node, indent: -1).strip }.join

This Works Just Fine unless any top-level node node has multiple child nodes; i.e., a mixture of ordinary text strings and Ox::Element instances. Broadly, we want to walk the given node tree and, for each tree-walked value of node

  1. clean up the child-node string when there's only one child of node and it's a string;
  2. recurse if the one child of node is an element;
  3. and when there are multiple child nodes (i.e., node.nodes.count > 1), we walk those child nodes, applying
    1. one cleanup rule if the first child child is a string;
    2. another cleanup rule if the last child node is a string;
    3. yet another cleanup rule for any intermediate child nodes that are strings; and
    4. recurse for any elements.

Complicated enough for you? Does a factory + several command-pattern classes that walks the source Ox::Node tree seem like the quickest/easiest/preferably both way to get there?

jdickey commented 8 years ago

"I Can Name That Tune/Implement That Charlie-Fox In…"

The NodeCleanup#callmethod would take an Ox::Node as its parameter and return an array of objects, each having an #initialize method taking an Ox::Node as a parameter, as well as a #call method that takes no parameters. Added to the array returned from #call would be

  1. a NodeCleanup::SimpleElement instance if the passed-in node is a simple element (having a single child node which is a string);
  2. a NodeCleanup::Element instance if the passed-in node has multiple child nodes or if the single child node is itself an Element. That class in turn would make use of
    1. a NodeCleanup::StringNode instance for any string nodes. Formerly thought of as having separate classes for initial, terminal, and intermediate string nodes, the differences were discovered to be absolutely trivial and the classes merged; and;
    2. a NodeCleanup::Element instance for any child nodes that are Elements.

The array returned from NodeCleanup#call would therefore correspond to the direct child nodes of the passed-in node. Iteratively calling #call on each instance would then produce a sequence of Ox::Node (or Ox::Element) instances which are guaranteed to have cleaned-up text-string content. Rendering the containing element (the parameter to #call itself), using these new Nodes, is the responsibility of the caller.

The new classes, listed, are

  1. NodeCleanup (formerly NodeCleanup::Factory);
  2. NodeCleanup::SimpleElement;
  3. NodeCleanup::Element; and
  4. NodeCleanup::StringNode.

We're getting there.

jdickey commented 8 years ago

So, after an embarrassing discovery that rushed Commit ca31742 into existence, we have one of the seven requirements for MarkdownToHtml::Dumper::NodeCleanup::Element in the can. Recapping, those are:

  1. Any sequence of whitespace characters within a string will be replaced by a single space;
  2. Any string which is the first child node of an element will have any leading whitespace removed;
  3. Any string which is the first child node of an element will have any trailing whitespace replaced by a single space;
  4. Any string which is the last child node of an element will have any trailing whitespace removed;
  5. Any string which is the last child node of an element will have any leading whitespace replaced by a single space;
  6. Any string which is an intermediating child node of an element, i.e., which occurs between one child element of the parent element and another child element, will have any leading or trailing whitespace replaced by a single space;
  7. Any whitespace between block level elements (such as between two p elements or two div elements, but not between two span elements or between an em and span elements) will be removed.

Burnout can be overcome, yes? Probably with no more effort than that required for time to stand still while getting Real Work™ done. 😷

Although, in theory, Items 3, 5, and 6 should have been taken care of by Item 1, though that still needs to be formally verified. (At this stage, we don't trust; we verify.)

jdickey commented 8 years ago

Next up is a NodeCleanup::Element::TrimEnforcer class, the purpose of which can be described as

Walk tree of child elements, and within each element, iterate its (direct) child nodes, enforcing position-dependent rules on spacing within strings which occur within multiple child nodes of a single parent node (in practice, Element). Spacing policies vary depending on whether a string is an initial, intermediate, or terminal child node of a parent element.

For example, in the HTML markup

<p>This is <em>emphasised</em> content.</p>

the string This is is an :initial child of the p element; the string content. is a :terminal child of that element, and the string emphasised is the sole child of the enclosed em element (and will thus not be touched by this class).

Fortunately, most of the code for implementing this already exists and is covered by tests that appear sufficiently comprehensive.

jdickey commented 8 years ago

Whee! Well, that was educational, at least, in figuring out what needs to be done and in what order.

  1. It turns out that NodeCleanup::Element::TrimEnforcer really does need to clean up nodes that are single elements with string content; this will let Element deal with markup like \n <p> This is a test.\n</p>\n; it seems now that the trailing whitespace isn't getting trimmed, and it should.
  2. We remembered how to do the cleanup we want on NodeCleanup::Element; that's at the level of reformatting, but it does make the code clearer, if not any better. Coming (again) soon.
  3. Finally, our NodeCleanup class should simply be a wrapper that feeds a single Ox::Element or a collection of same to NodeCleanup::Element. (This is why TrimEnforcer needs its upgrade; if we pass in a collection of Ox::Element instances where any of them are simple single elements, nothing currently does the right thing. (That is what SimpleElement was built for, by the way; that and thinking through how to get this into real code).

The Buddha is reputed to have said something similar to "it does not matter how fast you go, so long as you do not stop". He neglected to add any caveats about the direction of your movement. :stuck_out_tongue_winking_eye:

jdickey commented 8 years ago

That first item in the immediate preceding comment has a few too many assumptions buried in it, that do an excellent imitation of claymores when you start digging too rashly, because any single element can serve as the root of a subtree of arbitrary content and depth.

This should have been obvious; again, we led ourselves down the garden path by coming up with some relatively simple test data to get started, and then forgetting that this generalisation is necessary; only implementing the few simple cases.

There's a reasonably simple general solution for this; we're just too burnt out to pull it out of the relevant orifice at the moment. And that clock ticking is doing wonders for our migraine.

jdickey commented 8 years ago

Commit bead957 does two things: it throws out a few days of work (mostly wrt NodeCleanup::Element, which is now completely gone), and it does what it bloody says on the tin! If anybody's wondering where the mature fertiliser went, our brains seem entirely interchangeable with it recently… :anguished:

jdickey commented 8 years ago

So now we're back up to the top level in Prolog::Services::MarkdownToHtml, and we're back to considering The $64 Question: what should that class' API look like?

What We Have and Why It's Not What We Want

The current code looks like this:

module Prolog
  module Services
    # Service to convert Markdown content to equivalent HTML.
    class MarkdownToHtml
      attr_reader :content, :wrap_with

      def initialize(wrap_with: :div)
        @wrap_with = wrap_with
        self
      end

      def call(content:)
        @content = Dumper.new(node: container_node(content)).to_s
        self
      end

      private

      def container_node(content)
        ContainerNode.new(content: content, wrap_with: wrap_with).to_node
      end
    end # class Prolog::Services::MarkdownToHtml
  end
end

Given a Markdown string with a single root element, it wraps the HTML representation of that content in an outer "container" (defaulting to a :div element), and dumps a (cleaned-up) fragment of equivalent HTML to the @content instance variable, from which it can be accessed via the #content attribute-reader method.

Issues

There are several problems with this.

It's overly presentational.

We want to convert Markdown content to HTML. Whether we want to wrap that HTML in another, containing, element should not be hard-wired into the "conversion" class.

It works only for a single element.

When #call is called with content of

### This is a Heading.

This is the first paragraph.

* First list item.
* Second list item
* _Last_ list item.

Final paragraph.

the converted content is

<div id="body-content" data-contribution-counter="0"><p>Final paragraph.</p></div>

Oops. What happened to the heading, the first paragraph, and the list? The HTML/XML parsing toolkit being used (Ox), when asked to parse a number of sibling nodes, returns only the last one parsed. This is a Problem that's our responsibility to deal with.

It carries unnecessary state.

As a result of the "overly presentational" problem described above, there is a rather pointless #initialize method and the worker method, #call, is an instance method. Ditching the initialiser and reimplementing #call as a class method returning the converted markup will allow client code to call

markup = Prolog::Services::MarkdownToHtml.call content: our_markdown_content

rather than the current

markup = Prolog::Services::MarkdownToHtml.new.call(content: our_markdown_content).content

Most devs would likely prefer the first example to the second, so long as no other instance methods on MarkdownToHtml are available. They presently aren't, and we don't envision that changing in future. And, last but hardly least,

Additional work needs to be done to fix data-contribution-counter.

In another bug hidden deep within the code, the data-contribution-counter attribute on the container element, which is intended to tell the Meldd code what the next contribution ID should be, is presently hardcoded to zero. This is an obvious placeholder bug, that goes away if we remove the presentational aspects from the class.

Replacement API

MarkdownToHtml.call will be the only method intended for public use. As a class method, no instantiation is required. It takes a single keyword parameter, content, which must be a string containing one or more elements of valid Markdown content. (Raw text will be interpreted as a sequence of paragraphs, separated as specified by the presence of newline characters in the text.)

Examples:

> markup = MarkdownToHtml.call content: "This is simple raw text.\nThis is more text."
=> "<p>This is simple raw text.</p><p>This is more text.</p>"
> content = <<~ENDIT
"           This is a paragraph.
"           
"           * *First* list item.
"           * **_Second_** list item.
"           * Last list item.
"           
"           This is closing text.
"           ENDIT
=> "This is a paragraph.\n\n* *First* list item.\n* **_Second_** list item.\n* Last list item.\n\nThis is closing text.\n"
> markup = Prolog::Services::MarkdownToHtml.call content: content
=> "<p>This is a paragraph.</p><ul><li><em>First</em> list item.</li><li><strong><em>Second</em></strong> list item.</li><li>Last list item.</li></ul><p>This is closing text.</p>"

In short, purely a conversion utility.

But What About the Presentation Stuff?

That should and will be coded as part of an application's UI implementation; it has no business here.

jdickey commented 8 years ago

Defeat(?) Snatched From the Jaws of Victory(?)

Our Markdown-to-HTML conversion rules on the HTML::Pipeline Gem to do the heavy lifting, and on Ox to parse and manipulate markup, including parser output. This presents a problem, in that Ox requires valid XML as input, which HTML::Pipeline provides, except for void elements. When HTML::Pipeline generates a br tag, for example, it's rendered as HTML3.2-standard <br>, not as XHTML-standard/HTML5-optional <br/>.

This can in turn be traced to the github-markdown Gem on which HTML::Pipeline depends for its Markdown parsing; as the linked RubyGems page states,

THIS GEM IS NOT MAINTAINED AND NOT SUPPORTED.

No link to source code exists on that page, and a search of GitHub provides no joy. As we wrote in HTML::Pipeline Issue 246 some three weeks ago, it's abandonware, and any bugs in it are there until HTML::Pipeline chooses another parser — which they've made quite plain they're in no hurry to do.

So we'll have to do it ourselves, and the logical place to search for and replace generated void elements would seem to be in MarkdownToHtml::Renderer::HtmlPipelineConverter::Internals.cleanup.

Pfffft. So close.


So how did we determine we needed a fix? We're rewriting the test suite for the top-level MarkdownToHtml service class as per the "Replacement API" in the previous comment, and we added a test

describe 'Prolog::Services::MarkdownToHtml' do
  let(:described_class) { Prolog::Services::MarkdownToHtml }
  # ...
  describe 'has a .call method that' do
    # ...
    describe 'returns' do
      let(:actual) { described_class.call content: @content }

      after { expect(actual).must_equal @expected }
      # ...
      it 'a para with content separated by a <br/> tag at an input newline' do
        @content = "This is a test.\nThis is another test."
        @expected = '<p>This is a test.<br/>This is another test.</p>'
      end
    end # describe 'returns'
  end # describe 'has a .call method that'
  # ...
end

This test fails:

$ ruby -Itest test/prolog/services/markdown_to_html_test.rb 
WARN: Unresolved specs during Gem::Specification.reset:
      byebug (~> 8.0)
WARN: Clearing out unresolved specs.
Please report a bug if this causes problems.

# Running tests with run options --seed 18757:

..F
Failure:
Prolog::Services::MarkdownToHtml::has a .call method that::returns#test_0004_a para with content separated by a <br/> tag at an input newline [test/prolog/services/markdown_to_html_test.rb:23]
Minitest::Assertion: --- expected
+++ actual
@@ -1 +1 @@
-"<p>This is a test.<br/>This is another test.</p>"
+"<p>This is a test.<br>This is another test.</p>"

...

Finished tests in 0.086136s, 69.6571 tests/s, 81.2667 assertions/s.

6 tests, 7 assertions, 1 failures, 0 errors, 0 skips
$

Note the difference between the reported "expected" and "actual" values.

jdickey commented 8 years ago

After Commit e67745c, we've technically completed work minimally required for this issue (and PR #24), which would allow us to build Gem release 0.6.0 and close the milestone. Life is suddenly much better than it was a couple of days ago.

One quibble: the whole reason Gem release 0.6.0 got re-tasked for this, as opposed to the ProposeEditContribution use case, was because this service was accreted onto the use case Gem, blocking further progress on use cases. The first thing you learn when a hand grenade blows up in your face (and you survive) is that you really, really don't want to do that ever again.

We're going to sleep on it, but our strong intention is to take a couple of hours tomorrow, extract the MarkdownToHtml service into a separate Gem, and have the prolog-use_cases Gem depend on that. This would serve to (largely) decouple future maintenance of the service from development/maintenance of the use-case Gem. Also, since the service exposes no proprietary IP relevant to our application, releasing the service Gem as open source would be feasible. Doing this would allow (future) maintenance to use several important, Useful tools and services for free that must be paid for when used by proprietary code.

jdickey commented 8 years ago

Closed as of Commit 32357dd.