Closed jdickey closed 8 years ago
And, once again, effort-saving features of libraries tend to have more limited use cases than is immediately obvious.
Consider the line of code
nodes.map { |node| Ox.dump node, indent: -1 }.map(&:strip).join
when nodes
is an array containing an Ox::Element
instance created from parsing the HTML
<li>First <em>new</em>\n item</li>
What the Ruby code does is
-1
indent tells Ox to skip inserting newlines between child nodes (and that seems to be the only place that behaviour is documented :disappointed:);The output produced from that input HTML is
<li>First <em>new</em>item.</li>
Notice the lack of any space between the end of the em
element and the item.
string (text node) following it; since em
is not a block element, correct browser behaviour is to render the text representations of the two nodes immediately consecutively.
This is not what we want.
It appears that we must walk each node explicitly and inspect its child nodes, such that
This is basic stuff, and it just boggles the mind that none of the Markdown- or HTML/XML-parsing libraries evaluated to this point seem to include it as standard tooling.
Yep. Not immediately extracting the services we've developed thus far (ContentSelection
and MarkdownToHtml
) to separate Gems was a conscious decision taken "to speed things up". See the opening title of this post.
Obviously, the above requires an extensive rewrite of the Prolog::Services::MarkdownToHtml::Dumper
class. At the same time, the existing test code for that class doesn't come close to the granularity specified in the above comment, with only a single test for an element with child elements. :unamused:
Right. What we've been doing up to now is iterating over the *top-level nodes) we've been passed in, using code very much like
nodes.map { |node| Ox.dump(node, indent: -1).strip }.join
This Works Just Fine unless any top-level node node
has multiple child nodes; i.e., a mixture of ordinary text strings and Ox::Element
instances. Broadly, we want to walk the given node
tree and, for each tree-walked value of node
node
and it's a string;node
is an element;node.nodes.count > 1
), we walk those child nodes, applying
Complicated enough for you? Does a factory + several command-pattern classes that walks the source Ox::Node
tree seem like the quickest/easiest/preferably both way to get there?
The NodeCleanup#call
method would take an Ox::Node
as its parameter and return an array of objects, each having an #initialize
method taking an Ox::Node
as a parameter, as well as a #call
method that takes no parameters. Added to the array returned from #call
would be
NodeCleanup::SimpleElement
instance if the passed-in node
is a simple element (having a single child node which is a string);NodeCleanup::Element
instance if the passed-in node has multiple child nodes or if the single child node is itself an Element. That class in turn would make use of
NodeCleanup::StringNode
instance for any string nodes. Formerly thought of as having separate classes for initial, terminal, and intermediate string nodes, the differences were discovered to be absolutely trivial and the classes merged; and;NodeCleanup::Element
instance for any child nodes that are Elements.The array returned from NodeCleanup#call
would therefore correspond to the direct child nodes of the passed-in node. Iteratively calling #call
on each instance would then produce a sequence of Ox::Node
(or Ox::Element
) instances which are guaranteed to have cleaned-up text-string content. Rendering the containing element (the parameter to #call
itself), using these new Node
s, is the responsibility of the caller.
The new classes, listed, are
NodeCleanup
(formerly NodeCleanup::Factory
);NodeCleanup::SimpleElement
;NodeCleanup::Element
; andNodeCleanup::StringNode
.We're getting there.
So, after an embarrassing discovery that rushed Commit ca31742 into existence, we have one of the seven requirements for MarkdownToHtml::Dumper::NodeCleanup::Element
in the can. Recapping, those are:
p
elements or two div
elements, but not between two span
elements or between an em
and span
elements) will be removed.Burnout can be overcome, yes? Probably with no more effort than that required for time to stand still while getting Real Work™ done. 😷
Although, in theory, Items 3, 5, and 6 should have been taken care of by Item 1, though that still needs to be formally verified. (At this stage, we don't trust; we verify.)
Next up is a NodeCleanup::Element::TrimEnforcer
class, the purpose of which can be described as
Walk tree of child elements, and within each element, iterate its (direct) child nodes, enforcing position-dependent rules on spacing within strings which occur within multiple child nodes of a single parent node (in practice, Element). Spacing policies vary depending on whether a string is an initial, intermediate, or terminal child node of a parent element.
For example, in the HTML markup
<p>This is <em>emphasised</em> content.</p>
the string
This is
is an:initial
child of thep
element; the stringcontent.
is a:terminal
child of that element, and the stringemphasised
is the sole child of the enclosedem
element (and will thus not be touched by this class).
Fortunately, most of the code for implementing this already exists and is covered by tests that appear sufficiently comprehensive.
Whee! Well, that was educational, at least, in figuring out what needs to be done and in what order.
NodeCleanup::Element::TrimEnforcer
really does need to clean up nodes that are single elements with string content; this will let Element
deal with markup like \n <p> This is a test.\n</p>\n
; it seems now that the trailing whitespace isn't getting trimmed, and it should.NodeCleanup::Element
; that's at the level of reformatting, but it does make the code clearer, if not any better. Coming (again) soon.NodeCleanup
class should simply be a wrapper that feeds a single Ox::Element
or a collection of same to NodeCleanup::Element
. (This is why TrimEnforcer
needs its upgrade; if we pass in a collection of Ox::Element
instances where any of them are simple single elements, nothing currently does the right thing. (That is what SimpleElement
was built for, by the way; that and thinking through how to get this into real code).The Buddha is reputed to have said something similar to "it does not matter how fast you go, so long as you do not stop". He neglected to add any caveats about the direction of your movement. :stuck_out_tongue_winking_eye:
That first item in the immediate preceding comment has a few too many assumptions buried in it, that do an excellent imitation of claymores when you start digging too rashly, because any single element can serve as the root of a subtree of arbitrary content and depth.
This should have been obvious; again, we led ourselves down the garden path by coming up with some relatively simple test data to get started, and then forgetting that this generalisation is necessary; only implementing the few simple cases.
There's a reasonably simple general solution for this; we're just too burnt out to pull it out of the relevant orifice at the moment. And that clock ticking is doing wonders for our migraine.
Commit bead957 does two things: it throws out a few days of work (mostly wrt NodeCleanup::Element
, which is now completely gone), and it does what it bloody says on the tin! If anybody's wondering where the mature fertiliser went, our brains seem entirely interchangeable with it recently… :anguished:
So now we're back up to the top level in Prolog::Services::MarkdownToHtml
, and we're back to considering The $64 Question: what should that class' API look like?
The current code looks like this:
module Prolog
module Services
# Service to convert Markdown content to equivalent HTML.
class MarkdownToHtml
attr_reader :content, :wrap_with
def initialize(wrap_with: :div)
@wrap_with = wrap_with
self
end
def call(content:)
@content = Dumper.new(node: container_node(content)).to_s
self
end
private
def container_node(content)
ContainerNode.new(content: content, wrap_with: wrap_with).to_node
end
end # class Prolog::Services::MarkdownToHtml
end
end
Given a Markdown string with a single root element, it wraps the HTML representation of that content in an outer "container" (defaulting to a :div
element), and dumps a (cleaned-up) fragment of equivalent HTML to the @content
instance variable, from which it can be accessed via the #content
attribute-reader method.
There are several problems with this.
We want to convert Markdown content to HTML. Whether we want to wrap that HTML in another, containing, element should not be hard-wired into the "conversion" class.
When #call
is called with content of
### This is a Heading.
This is the first paragraph.
* First list item.
* Second list item
* _Last_ list item.
Final paragraph.
the converted content is
<div id="body-content" data-contribution-counter="0"><p>Final paragraph.</p></div>
Oops. What happened to the heading, the first paragraph, and the list? The HTML/XML parsing toolkit being used (Ox), when asked to parse a number of sibling nodes, returns only the last one parsed. This is a Problem that's our responsibility to deal with.
As a result of the "overly presentational" problem described above, there is a rather pointless #initialize
method and the worker method, #call
, is an instance method. Ditching the initialiser and reimplementing #call
as a class method returning the converted markup will allow client code to call
markup = Prolog::Services::MarkdownToHtml.call content: our_markdown_content
rather than the current
markup = Prolog::Services::MarkdownToHtml.new.call(content: our_markdown_content).content
Most devs would likely prefer the first example to the second, so long as no other instance methods on MarkdownToHtml
are available. They presently aren't, and we don't envision that changing in future. And, last but hardly least,
data-contribution-counter
.In another bug hidden deep within the code, the data-contribution-counter
attribute on the container element, which is intended to tell the Meldd code what the next contribution ID should be, is presently hardcoded to zero. This is an obvious placeholder bug, that goes away if we remove the presentational aspects from the class.
MarkdownToHtml.call
will be the only method intended for public use. As a class method, no instantiation is required. It takes a single keyword parameter, content
, which must be a string containing one or more elements of valid Markdown content. (Raw text will be interpreted as a sequence of paragraphs, separated as specified by the presence of newline characters in the text.)
Examples:
> markup = MarkdownToHtml.call content: "This is simple raw text.\nThis is more text."
=> "<p>This is simple raw text.</p><p>This is more text.</p>"
> content = <<~ENDIT
" This is a paragraph.
"
" * *First* list item.
" * **_Second_** list item.
" * Last list item.
"
" This is closing text.
" ENDIT
=> "This is a paragraph.\n\n* *First* list item.\n* **_Second_** list item.\n* Last list item.\n\nThis is closing text.\n"
> markup = Prolog::Services::MarkdownToHtml.call content: content
=> "<p>This is a paragraph.</p><ul><li><em>First</em> list item.</li><li><strong><em>Second</em></strong> list item.</li><li>Last list item.</li></ul><p>This is closing text.</p>"
In short, purely a conversion utility.
That should and will be coded as part of an application's UI implementation; it has no business here.
Our Markdown-to-HTML conversion rules on the HTML::Pipeline Gem to do the heavy lifting, and on Ox to parse and manipulate markup, including parser output. This presents a problem, in that Ox requires valid XML as input, which HTML::Pipeline provides, except for void elements. When HTML::Pipeline generates a br
tag, for example, it's rendered as HTML3.2-standard <br>
, not as XHTML-standard/HTML5-optional <br/>
.
This can in turn be traced to the github-markdown
Gem on which HTML::Pipeline depends for its Markdown parsing; as the linked RubyGems page states,
THIS GEM IS NOT MAINTAINED AND NOT SUPPORTED.
No link to source code exists on that page, and a search of GitHub provides no joy. As we wrote in HTML::Pipeline Issue 246 some three weeks ago, it's abandonware, and any bugs in it are there until HTML::Pipeline chooses another parser — which they've made quite plain they're in no hurry to do.
So we'll have to do it ourselves, and the logical place to search for and replace generated void elements would seem to be in MarkdownToHtml::Renderer::HtmlPipelineConverter::Internals.cleanup
.
Pfffft. So close.
So how did we determine we needed a fix? We're rewriting the test suite for the top-level MarkdownToHtml
service class as per the "Replacement API" in the previous comment, and we added a test
describe 'Prolog::Services::MarkdownToHtml' do
let(:described_class) { Prolog::Services::MarkdownToHtml }
# ...
describe 'has a .call method that' do
# ...
describe 'returns' do
let(:actual) { described_class.call content: @content }
after { expect(actual).must_equal @expected }
# ...
it 'a para with content separated by a <br/> tag at an input newline' do
@content = "This is a test.\nThis is another test."
@expected = '<p>This is a test.<br/>This is another test.</p>'
end
end # describe 'returns'
end # describe 'has a .call method that'
# ...
end
This test fails:
$ ruby -Itest test/prolog/services/markdown_to_html_test.rb
WARN: Unresolved specs during Gem::Specification.reset:
byebug (~> 8.0)
WARN: Clearing out unresolved specs.
Please report a bug if this causes problems.
# Running tests with run options --seed 18757:
..F
Failure:
Prolog::Services::MarkdownToHtml::has a .call method that::returns#test_0004_a para with content separated by a <br/> tag at an input newline [test/prolog/services/markdown_to_html_test.rb:23]
Minitest::Assertion: --- expected
+++ actual
@@ -1 +1 @@
-"<p>This is a test.<br/>This is another test.</p>"
+"<p>This is a test.<br>This is another test.</p>"
...
Finished tests in 0.086136s, 69.6571 tests/s, 81.2667 assertions/s.
6 tests, 7 assertions, 1 failures, 0 errors, 0 skips
$
Note the difference between the reported "expected" and "actual" values.
After Commit e67745c, we've technically completed work minimally required for this issue (and PR #24), which would allow us to build Gem release 0.6.0 and close the milestone. Life is suddenly much better than it was a couple of days ago.
One quibble: the whole reason Gem release 0.6.0 got re-tasked for this, as opposed to the ProposeEditContribution
use case, was because this service was accreted onto the use case Gem, blocking further progress on use cases. The first thing you learn when a hand grenade blows up in your face (and you survive) is that you really, really don't want to do that ever again.
We're going to sleep on it, but our strong intention is to take a couple of hours tomorrow, extract the MarkdownToHtml
service into a separate Gem, and have the prolog-use_cases
Gem depend on that. This would serve to (largely) decouple future maintenance of the service from development/maintenance of the use-case Gem. Also, since the service exposes no proprietary IP relevant to our application, releasing the service Gem as open source would be feasible. Doing this would allow (future) maintenance to use several important, Useful tools and services for free that must be paid for when used by proprietary code.
Closed as of Commit 32357dd.
Currently, when you call
you get handed back a string of HTML (reformatted for readability; actual is a continuous string without line breaks) like
This goes back to being excessively presentation-oriented, as we thought we'd picked up on before. This is coming back to bite us anytime we deal with possibly rich content, whether it be proposing an edit contribution, publishing an article, whatever. It needs to be ripped out as default behaviour. Rather,
MarkdownToHtml#call
should assign just the converted markup to be readable from#content
.We'll want a separate class/decorator module to do presentational shtuff like wrapping content in a
div
or other container, with thedata-contribution-counter
attribute along for the ride.This would also fix the bug that the actual setting of the
data-contribution-counter
value was obviously something that fell through the cracks, as this line demonstrates. :disappointed: