baynezy / Html2Markdown

A library for converting HTML to markdown syntax in C#
Apache License 2.0
276 stars 52 forks source link

Support OneNote html for bold and italic #60

Open idvorkin opened 7 years ago

idvorkin commented 7 years ago

Onenote encodes its HTML pages in a way that's close to what Html2Markdown supports but Onenote HTML does bold and italics as follow:

Property Example
font-style style="font-style:italic" (normal or italic only)
font-weight style="font-weight:bold" (normal or bold only)
strike-through style="text-decoration:line-through"
text-align style="text-align:center" (for block elements only)
text-decoration style="text-decoration:underline" (none or underline only)

I'm willing to make the changes if you tell me how you want me to fix.

baynezy commented 7 years ago

@idvorkin - let me complete #61 first. This will make it more straightforward to implement.

baynezy commented 7 years ago

@idvorkin - #61 is complete. If you want to support Onenote HTML. You will need to create a new IScheme implementation, you can extend Markdown. Let me know if that doesn't make sense, or you need help.

idvorkin commented 7 years ago

I was thinking the OneNote HTML representation of font properties would apply to other tools generating HTML, so we should have it be something the default converter understands.

Based on that, I'm think we'd implement by creating a new CustomerReplacer.CustomAction, which I'd include in the MarkDown._replacers list. Am I on the right track?

idvorkin commented 7 years ago

I was thinking of only implementing these font decorations when they appear in span elements (where I normally observe them). The spec says these styles can also appear in other elements, where it gets trickier to implement.

Thinking out loud, if we want to implement for non span elements we can do a two pass approach: 1) Add a span element around the original element content. 2) Run span replacement.

For example, imagine the following input:

  <_h1 style="bold"> BLAH> </h1> 

Step 1: Spanify

 <_h1> <_span style="BOLD"> BLAH></span> </xh1> 

Step 2: Run span transformer.

baynezy commented 7 years ago

@idvorkin - Please don't modify Markdown that is for support of the vanilla Markdown spec. To support OneNote create a OneNote implementation of IScheme extending Markdown as outlined. The functions for the parsing can live in either your new class or you can put them in HtmlParser.

idvorkin commented 7 years ago

As they say weeks of coding can save hours of design :) Happy to sync in chat/voice/video if that's fastest

I'd love to better understand your design choice. How do you decide when an HTML representation should be part of the core converter vs a different scheme? The <strong> element you mention is an excellent example. I'd expect it to map to bold in markdown.

baynezy commented 7 years ago

https://gitter.im/Html2Markdown/issue-60

idvorkin commented 7 years ago

FYI, for the transform approach I'm thinking something like this:

var styleToElementName = new Dictionary<string, string>()
{
    {"font-weight:bold","b"},
    {"font-style:italics","i"},
};

var onenoteHTML = @"<td style=''><span style='font-weight:bold'>Expected Bold </span></td>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(onenoteHTML);

foreach (var s2e in styleToElementName)
{
    var styledElements = doc.DocumentNode.SelectNodes($"//span[@style='{s2e.Key}']");
    foreach (var element in styledElements)
    {
        element.Name = s2e.Value;
        element.Attributes.Where(a => a.Name == "style" && a.Value == s2e.Key).ToList()
               .ForEach(a => element.Attributes.Remove(a));
    }
}
aloneguid commented 5 years ago

There are OneNote fixes which work for me. I assume that tables don't have line breaks, otherwise this neds extra processing (replacing with br tag):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using Html2Markdown.Replacement;
using Html2Markdown.Scheme;
using HtmlAgilityPack;

namespace OneSyncTool.Core
{
   class Html2MarkdownScheme : IScheme
   {
      private readonly Markdown _builtIn = new Markdown();
      private readonly List<IReplacer> _replacers;

      public Html2MarkdownScheme()
      {
         _replacers = new List<IReplacer>(_builtIn.Replacers());

         //OneNote block decoration
         _replacers.Add(new PatternReplacer("<div\\s+style\\s*=\\s*\"position:absolute(.+?)>", ""));
         _replacers.Add(new PatternReplacer("</div>", ""));

         //everything else
         _replacers.Add(new OneNoteHapReplacer());
      }

      public IList<IReplacer> Replacers() => _replacers;

      internal class OneNoteHapReplacer : IReplacer
      {
         public string Replace(string html)
         {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            ProcessFontStyles(doc);
            ProcessTables(doc);

            return doc.DocumentNode.OuterHtml;
         }

         private void ProcessFontStyles(HtmlDocument doc)
         {
            HtmlNodeCollection fontStyles = doc.DocumentNode.SelectNodes("//span[@style]");
            foreach (HtmlNode node in fontStyles)
            {
               string style = node.GetAttributeValue("style", null);
               if (style == null) continue;

               string[] styles = style.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim()).ToArray();
               var decorations = new List<string>();
               if (styles.Contains("font-style:italic")) decorations.Add("_");
               if (styles.Contains("font-weight:bold")) decorations.Add("**");
               if (styles.Contains("font-decoration:line-through")) decorations.Add("~~");
               // there's no underline in markdown? ignore it for now

               string replacement = Decorate(node.InnerHtml, decorations);

               node.ParentNode.ReplaceChild(doc.CreateTextNode(node.InnerHtml), node);
            }
         }

         private void ProcessTables(HtmlDocument doc)
         {
            HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("table");
            foreach(HtmlNode table in tables)
            {
               var s = new StringBuilder();
               bool isHeader = true;

               //there are text nodes in children, they are just line breaks and safe to ignore
               foreach(HtmlNode row in table.ChildNodes.Where(n => n.Name == "tr"))
               {
                  int cellCount = 0;
                  s.Append("|");
                  foreach(HtmlNode cell in row.ChildNodes.Where(n => n.Name == "td"))
                  {
                     s.Append(cell.InnerText.Trim());
                     s.Append("|");
                     cellCount++;
                  }
                  s.AppendLine();

                  if(isHeader)
                  {
                     s.Append("|");
                     for(int i = 0; i < cellCount; i++)
                     {
                        s.Append("-|");
                     }
                     s.AppendLine();
                     isHeader = false;
                  }
               }

               table.ParentNode.ReplaceChild(doc.CreateTextNode(s.ToString()), table);
            }
         }

         private string Decorate(string text, IReadOnlyCollection<string> decorations)
         {
            foreach(string dec in decorations)
            {
               text = dec + text + text;
            }

            return text + Environment.NewLine; //append new line because it's in a span
         }
      }

      internal class PatternReplacer : IReplacer
      {
         public PatternReplacer(string pattern, string replacement)
         {
            Pattern = pattern;
            Replacement = replacement;
         }

         public string Pattern { get; }

         public string Replacement { get; }

         public string Replace(string html)
         {
            return new Regex(Pattern).Replace(html, Replacement);
         }
      }
   }
}
aloneguid commented 5 years ago

Just to demo it, original onenote page:

image

exported to markdown:

image