TheProlog / prolog-use_cases

Use-case layer for Meldd/Prolog application.
0 stars 0 forks source link

Develop "Validate Selection" Use Case. #30

Closed jdickey closed 8 years ago

jdickey commented 8 years ago

As specified in this Wiki page. In the app, success of this use case is a prerequisite for entering the "Propose Edit Contribution" use case.

Once this is working, should we then (or later) strip now-redundant validation out of that use case?

Implementation of this is the prerequisite for build and release of Gem Release 0.8.0.

jdickey commented 8 years ago

I can see the discussion in the rails-core team now:

Member A: Hey, we have a problem. We implement a #delegate method in ActiveModel that will break any code that expects the standard-library version. We really shouldn't break the standard library…should we? Member D: Don't worry about it. Nobody's going to include both the standard library forwardable and our version and not use our semantics. Member A: But…I did. Member D: Next item on the agenda is…

Veins of Kryptonite marring the Ruby.

jdickey commented 8 years ago

This is gonna get greasy!

Let's say that we have just about the simplest content possible as original content:

<p>This is sample content.</p>

Someone proposes a contribution:

<p>This is <a id="contribution-1-begin"></a>sample<a id="contribution-1-end"></a> content.</p>

Easy-peasy. We wouldn't need a real parser for that; a regular expression could pull out the contribution ID and wrapped content:

target = '<a id="contribution-(\d+?)-begin"></a>(.+?)<a id="contribution-\1-end"></a>'
re = Regexp.new target
matches = re.match markup
matches.captures.count
# => 2
matches.captures.count[0]
# => "1"
matches.captures.count[1]
# => "sample"

Life gets a bit more complicated when we have two overlapping contributions (without them both being Proposed, obviously)

<p>This is <a id="contribution-27-begin"></a><a id="contribution-9-begin"></a>sample<a id="contribution-27-end"></a> content.<a id="contribution-9-end"></a></p>

This can't be parsed fully by a single regular expression. Time to break out the parser

require 'ox'
require 'awesome_print'

markup = '<p>This is <a id="contribution-27-begin"></a><a id="contribution-9-begin"></a>sample<a id="contribution-27-end"></a> content.<a id="contribution-9-end"></a></p>'
para = Ox.parse markup
ap para.nodes 

The output from that last looks something like

[
    [0] "This is ",
    [1] #<Ox::Element:0x007f7f8a7c14b8 @value="a", @nodes=[], @attributes={:id=>"contribution-27-begin"}>,
    [2] #<Ox::Element:0x007f7f8a7c1378 @value="a", @nodes=[], @attributes={:id=>"contribution-9-begin"}>,
    [3] "sample",
    [4] #<Ox::Element:0x007f7f8a7c1170 @value="a", @nodes=[], @attributes={:id=>"contribution-27-end"}>,
    [5] " content.",
    [6] #<Ox::Element:0x007f7f8a7c0fe0 @value="a", @nodes=[], @attributes={:id=>"contribution-9-end"}>
]

We can then

  1. iterate through the nodes, looking for an <a> element with an id matching the pattern /contribution-(\d+)-begin/
  2. for each one we find, find the matching /contribution-(\d+)-end/ anchor tag pair;
  3. add the range defined by the start of the begin tag pair to the end of the end tag pair to a list;
  4. iterate until done;
  5. see if our to-be-validated endpoint range overlaps any of the detected ranges;
  6. for each matching range, ask the Contribution Repository if that contribution ID matches a Proposed Contribution or one that has already been responded to, failing the validation if any Proposed Contributions are identified.

Right; that's a bit more tedious, but pretty straightforward.


What happens when the begin and end tag pairs aren't direct children of the same parent node? For example,

<p>This is <a id="contribution-27-begin"></a><a id="contribution-9-begin"></a>obviously <em>sample<a id="contribution-27-end"></a> and disposable</em> content.<a id="contribution-9-end"></a></p>  

Reformatting for clarity, we have

<p>This is
  <a id="contribution-27-begin"></a>
  <a id="contribution-9-begin"></a>
  obviously
  <em>sample
    <a id="contribution-27-end"></a>
    and disposable
  </em>
  content.
  <a id="contribution-9-end"></a>
</p>

Contribution 9 is still all on the same level but, to find where Contribution 27 ends, we're going to go hunting. We'd be very surprised if Ox lacks API methods to locate the element we're looking for, and then, ideally, "give me all the markup between Nodes A and B". We'll begin that hunt (later) in the morning.

jdickey commented 8 years ago

There wasn't an "obvious" way to do what we wanted to with Ox, so we fell back to Nokogiri and started poking. We found this nokogiri-xml-range Gem which would be Amazingly Useful if it worked with current Nokogiri. (Hey, it's an 0.1.0 release; what do you expect?)

 pry
[1] pry(main)> require 'nokogiri/xml/range'
true
[2] pry(main)> markup = '<p>This is <a id="contribution-27-begin"></a><a id="contribution-9-begin"></a>obviously <em>sample<a id="contribution-27-end"></a> and disposable</em> content.<a id="contribution-9-end"></a></p>'
"<p>This is <a id=\"contribution-27-begin\"></a><a id=\"contribution-9-begin\"></a>obviously <em>sample<a id=\"contribution-27-end\"></a> and disposable</em> content.<a id=\"contribution-9-end\"></a></p>"
[3] pry(main)> para = Nokogiri.parse markup
#<Nokogiri::XML::Document:0x3fc0f36bab18 name="document" children=[#<Nokogiri::XML::Element:0x3fc0f36b7a58 name="p" children=[#<Nokogiri::XML::Text:0x3fc0f36b7620 "This is ">, #<Nokogiri::XML::Element:0x3fc0f36b7440 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b7350 name="id" value="contribution-27-begin">]>, #<Nokogiri::XML::Element:0x3fc0f36b6a18 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b6928 name="id" value="contribution-9-begin">]>, #<Nokogiri::XML::Text:0x3fc0f36b6004 "obviously ">, #<Nokogiri::XML::Element:0x3fc0f36b3e58 name="em" children=[#<Nokogiri::XML::Text:0x3fc0f36b3a20 "sample">, #<Nokogiri::XML::Element:0x3fc0f36b387c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b37b4 name="id" value="contribution-27-end">]>, #<Nokogiri::XML::Text:0x3fc0f36b2e90 " and disposable">]>, #<Nokogiri::XML::Text:0x3fc0f36b2ad0 " content.">, #<Nokogiri::XML::Element:0x3fc0f36b2940 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b2850 name="id" value="contribution-9-end">]>]>]>
[4] pry(main)> begin_27 = para.search '//a[@id="contribution-27-begin"]'
[#<Nokogiri::XML::Element:0x3fc0f36b7440 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b7350 name="id" value="contribution-27-begin">]>]
[5] pry(main)> end_27 = para.search '//a[@id="contribution-27-end"]'
[#<Nokogiri::XML::Element:0x3fc0f36b387c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b37b4 name="id" value="contribution-27-end">]>]
[6] pry(main)> range = Nokogiri::XML::Range.new begin_27, 0, end_27, -1
#<Nokogiri::XML::Range:0x007f81e6cbd288 @start_container=[#<Nokogiri::XML::Element:0x3fc0f36b7440 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b7350 name="id" value="contribution-27-begin">]>], @start_offset=0, @end_container=[#<Nokogiri::XML::Element:0x3fc0f36b387c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc0f36b37b4 name="id" value="contribution-27-end">]>], @end_offset=-1>
[7] pry(main)> cloned = range.clone_contents
NoMethodError: undefined method `ancestors' for #<Nokogiri::XML::NodeSet:0x007f81e4df69d0>
from /usr/local/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-xml-range-0.1.0/lib/nokogiri/xml/range.rb:133:in `common_ancestor_container'
[8] pry(main)> 

In Nokogiri as presently released, NodeRange instances do not respond to #ancestors, although Node instances do.

We also notice that Nokogiri handles Path queries with fewer surprises than Ox, but that's another rant entirely.

Our options at this point appear to be:

  1. Fix the breakage in nokogiri-xml-range and use that;
  2. Write our own equivalent code;
  3. Write code that slices the raw, single-string HTML from the beginning to ending tag pairs and just grab what's between those as raw markup.

Why isn't Option 3 really an option? We can be reasonably sure that the markup as a whole is valid, well-formed XHTML; any arbitrary range within it is highly likely to need fairly powerful massaging. :mask:

jdickey commented 8 years ago

D'oh! :astonished:

We're making this way More complex than it needs to be. Consider that selections are mad from a UI that presents (formatted) content; in other words, any selection the user makes will be of endpoints where it is "safe" to drop in an <a id="contribution-999-begin"></a> or <a id="contribution-999-end"></a> tag pair, because any selection endpoint will be in or at either end of a text node, where phrasing content is always permitted to be inserted. The <a> tag is treated as phrasing content if it contains only phrasing content. A text string, empty or otherwise, is phrasing content here.

So a more reasonable workflow might be:

  1. Get a copy of the article body content, as HTML;
  2. Insert a dummy endpoint marker tag pair at the selection end, like <a id="dummy-dummy-end"></a>; 3 Insert a dummy endpoint marker tag pair at the selection end;
  3. Parse the combined markup using the parser of choice, failing the proposal if any of our validation conditions are determined to be violated;
  4. Go on from there.

That whole Item 3 under Preconditions just goes away.

jdickey commented 8 years ago

Well, no; Item 3 doesn't "just [go] away"; there are still the questions of whether the HTML converted from the Markdown source (which itself can contain hand-entered HTML) is valid and well-formed, and whether that HTML can be inserted in place of the existing, selected content and the resulting Article Body remains valid and well-formed.

The Markdown-to-HTML conversion introduces its own wrinkle.

Consider a (subset of an) Article Body with the content (reformatted for readability)

<ul>
<li>A first item;</li>
<li>A <em>second</em> item; and
<li>A final item.</li>
</ul>

A Member wishes to propose an Edit Contribution that would change the content A <em>second</em> to simply Another. He engages the UI to select the relevant "text", and enters as proposed replacement content Another.

Following standard Markdown conventions, the entered text Another is rendered into HTML as <p>Another</p>. Replacing the selected content would yield markup as

<ul>
<li>A first item;</li>
<li><p>Another</p> item; and
<li>A final item.</li>
</ul>

which is invalid HTML, and will be flagged as such by a strictly conforming parser (like Ox; Nokogiri will simply hand back tag soup).

"Well, then", you might well say, "we should simply use Nokogiri instead of Ox". The problem with that is that the resulting tag soup, being invalid, is unlikely to be rendered as desired on all browsers, or necessarily consistently between any two, and there is no simple way for the Member to correct it save by selecting the entire list and re-entering a new list.

Should we:

  1. just hold our noses and let Nokogiri hand us back invalid tag soup (that the Ox we use pervasively elsewhere will refuse to parse); or
  2. if the resulting article-body markup is invalid (as above) and the markup is a single "paragraph", strip the enclosing <p> tag pair off and retry validation interpolating the bare text instead; or
  3. play further mind games based on the selected content being part of a list item (and, if so, how?);

And what if they want to add items to or remove items from a list?

How in blazes do other sites handle this crap?

jdickey commented 8 years ago

You know, this would be several hells of a lot easier if we could find a reliable HTML-to-Markdown converter; then simple elements like list items would be simple text-replacements.

And no, we still find the idea that ordinary users should be expected to enter raw HTML a proven recipe for disaster; we're not going there.

jdickey commented 8 years ago

Actually, there is a surprisingly-capable HTML-to-Markdown converter: Pandoc.

There's even a Ruby binding for it, which should help lots. When we say "wrapper", bear in mind that the Gem appears to actually shell out to the pandoc executable, which would increase the load on the server.

"But if it really works…", he says, wistfully.

mitpaladin commented 8 years ago

:P can try Pandoc I'm ok with allowing users to edit HTML, so long as it is validated before allowing it to be submitted

jdickey commented 8 years ago

No need. $DEITY, why didn't we find this/figure it out 2 years ago?!?

Pandoc works nicely.

  1. Install pandoc and pandoc-ruby.
  2. Find a reasonably complex but low-script HTML page; I used a saved copy of http://manuscripts.github.io/markdown-lint as HTML input;
  3. Use pandoc-ruby to convert to Markdown; save that to a file.
  4. Use the command-line Pandoc, or a tool of your choice (I used Typora, my Markdown editor) to read that Markdown file and save it to a new, different HTML file.
  5. Inspect the re-translated and re-generated HTML, comparing the content to the original.

It looks great. We lose the CSS and metadata content, but that's fine; we're only interested in converting fragments, not entire documents. It's certainly good enough for 0.5, and calls into question the need to persist HTML at all; inserting identified anchor tag pairs into Markdown content will Do The Right Thing, as Markdown is a proper superset of HTML.

Win?

mitpaladin commented 8 years ago

:) sounds good

jdickey commented 8 years ago

And no, we can't do everything in Markdown, for the simple reason that the Member is making a selection (and generating endpoints) based on rendered HTML content. But this certainly makes life a lot easier.

jdickey commented 8 years ago

Hmmm. Consider:

$ pry
[1] pry(main)> require 'awesome_print'
false
[2] pry(main)> require 'pandoc-ruby'
true
[3] pry(main)> content = %(<p>This is
[3] pry(main)*   <a id="contribution-27-begin"></a>
[3] pry(main)*   <a id="contribution-9-begin"></a>
[3] pry(main)*   obviously
[3] pry(main)*   <em>sample
[3] pry(main)*     <a id="contribution-27-end"></a>
[3] pry(main)*     and disposable
[3] pry(main)*   </em>
[3] pry(main)*   content.
[3] pry(main)*   <a id="contribution-9-end"></a>
[3] pry(main)* </p>)
"<p>This is\n  <a id=\"contribution-27-begin\"></a>\n  <a id=\"contribution-9-begin\"></a>\n  obviously\n  <em>sample\n    <a id=\"contribution-27-end\"></a>\n    and disposable\n  </em>\n  content.\n  <a id=\"contribution-9-end\"></a>\n</p>"
[4] pry(main)> content = content.lines.map(&:strip).join
"<p>This is<a id=\"contribution-27-begin\"></a><a id=\"contribution-9-begin\"></a>obviously<em>sample<a id=\"contribution-27-end\"></a>and disposable</em>content.<a id=\"contribution-9-end\"></a></p>"
[5] pry(main)> markdown = PandocRuby.convert(content, from: :html, to: :markdown_github)
"This is<span id=\"contribution-27-begin\"></span><span id=\"contribution-9-begin\"></span>obviously*sample<span id=\"contribution-27-end\"></span>and disposable*content.<span id=\"contribution-9-end\"></span>\n"
[6] pry(main)> html = PandocRuby.convert(markdown, from: :markdown_github, to: :html)
"<p>This is<span id=\"contribution-27-begin\"></span><span id=\"contribution-9-begin\"></span>obviously<em>sample<span id=\"contribution-27-end\"></span>and disposable</em>content.<span id=\"contribution-9-end\"></span></p>\n"
[7] pry(main)> ap [content, html]
[
    [0] "<p>This is<a id=\"contribution-27-begin\"></a><a id=\"contribution-9-begin\"></a>obviously<em>sample<a id=\"contribution-27-end\"></a>and disposable</em>content.<a id=\"contribution-9-end\"></a></p>",
    [1] "<p>This is<span id=\"contribution-27-begin\"></span><span id=\"contribution-9-begin\"></span>obviously<em>sample<span id=\"contribution-27-end\"></span>and disposable</em>content.<span id=\"contribution-9-end\"></span></p>\n"
]
nil
[8] pry(main)> exit
$

Pandoc converts our <a id="contribution-27-begin"></a> tag pairs to <span id="contribution-27-begin"></span> tag pairs. While the semantics of the two tags differ, from a functional, parsing perspective, it shouldn't make a difference. Our code that deals with anchor tags "just" has to change to work with spans instead. It beats spending a week trying to implement our own HTML-to-Markdown conversion. :stuck_out_tongue_winking_eye:

jdickey commented 8 years ago

Contrary to the commit essay in Commit 1784f53 (which you should read to understand this), our Prolog::Services::MarkdownToHtml class does indeed convert Markdown with odd embedded HTML correctly.

$ pry
[1] pry(main)> require 'prolog/services/markdown_to_html'
true
[2] pry(main)> markdown = %(- A first item;
[2] pry(main)* - <a id="whatever-begin"></a>A second<a id="whatever-end"></a> item; and
[2] pry(main)* - A final item.)
"- A first item;\n- <a id=\"whatever-begin\"></a>A second<a id=\"whatever-end\"></a> item; and\n- A final item."
[3] pry(main)> html = Prolog::Services::MarkdownToHtml.call(content: markdown)
"<ul><li>A first item;</li><li><a id=\"whatever-begin\"></a>A second<a id=\"whatever-end\"></a> item; and</li><li>A final item.</li></ul>"
[4] pry(main)> puts html;
<ul><li>A first item;</li><li><a id="whatever-begin"></a>A second<a id="whatever-end"></a> item; and</li><li>A final item.</li></ul>
[5] pry(main)> exit
$

The second list item is indeed recognised as a list item. Converting using Pandoc for that stage fails as described in the essay. If we weren't so tired, we'd be deeply, disturbingly confused.

jdickey commented 8 years ago

So now that we have this round-trip capability, being able to convert HTML to Markdown and back again, how to use it?

We want to

  1. take the existing (HTML) content, the endpoints (relative to that HTML content), and the proposed content (in Markdown);
  2. insert marker tag pairs in the HTML at the endpoints specified, producing what we'll call intermediate HTML;
  3. ensure that that intermediate HTML is still valid and well-formed;
  4. convert the intermediate HTML to Markdown as a whole;
  5. replace the content within the marker tag pairs with the proposed Markdown content;
  6. convert that updated Markdown back to HTML, producing candidate HTML;
  7. validate the candidate HTML as being valid and well-formed.

But hey…we're going to need to do this "again" in the Propose Edit Contribution use case, right? So let's extract this into a new service class; call it Prolog::Services::ContentReplacement, with the happy path outlined above. Error conditions, any of which would abort the process, would include

  1. the intermediate HTML fails validation (step 3);
  2. the HTML-to-Markdown conversion step fails;
  3. the Markdown-to-HTML conversion step fails; and
  4. the candidate HTML is not well-formed, valid markup.
jdickey commented 8 years ago

The Wiki page has been updated to show that Replacement Content is treated as HTML and not conver­ted from or to Markdown. Also cleaned up some inconsistencies, particularly whether a Contribution Re­pository is to be a parameter (it isn't; this use case "has no Repository interactions; it operates only on a supplied Article entity" and does not actually interact with the Contribution Repository at all. (Where did that come from?) Also, as far as is presently known, the Authoriser will only be queried as to whether there is a Member presently logged in (in which case its #guest? method would return false); access to a #current_user name should not be necessary at this time.

jdickey commented 8 years ago

Yes, that's a code smell. We should do one of two things:

  1. Use Virtus.model rather than Virtus.value_object for the [form object], and iterate supported attributes for assignment rather than the current mass assignment from **params; or
  2. Rethink our use of form objects, at least in this instance, in favour of something like functional com­mand objects that separate out value objects/entities from validation, and contain no state other than idempotent collaborator instances injected into an otherwise stateless, and therefore immutable, instance.

Option 1 leads to even more ugly code as well. Option 2 is attractive; the next use case we implement that would otherwise use a form object in a similar manner to the ValidateSelection use case's form object, will instead be an experiment implementing this scheme instead. If successful, we'll retrofit existing use cases as time permits — definitely well before The Glorious Future™, and quite likely before a full 1.0 product release.

jdickey commented 8 years ago

Commit 1f59397 rubs our noses in a couple of screw-ups. The form object, by including ActiveModel::Validations, maintains its own error state, which is added to by code such as this. No migration of form-object errors to use-case-instance errors is performed in the use-case object. Pfffft.

If you're in a meta sort of mood, ponder how to notify of an error involving a missing or invalid UI gateway instance for the use case; how to report errors when the error-reporting mechanism itself isn't hooked up? The idea of a "result object", as several functional-approach interactors have used, with separate accessors for #success? and #errors, is starting to look much more attractive; it doesn't care who cares about success/failure or error specifics; it just packages them into an easy-to-kick-around value object. That would, incidentally, punt the need to delegate any #errors or #valid? method from the use-case instance to its form object.