PlaidWeb / reblob

Python program and library for extracting a quoted blog reply
MIT License
3 stars 0 forks source link

Better engine than pandoc/gfm? #5

Open fluffy-critter opened 5 years ago

fluffy-critter commented 5 years ago

Pandoc's gfm backend produces markdown like:

> For folks who were following me on Patreon and don’t have an RSS
> reader, here are some alternate ways of following me:
> 
>   - All my stuff gets automatically posted to
>     [Twitter](http://beesbuzz.biz/twitter),
>     [Tumblr](http://beesbuzz.biz/tumblr), and
>     [Mastodon](http://beesbuzz.biz/mastodon), although that’s not
>     ideal because updates are really easy to miss on those places
>   - You can use [IFTTT](http://ifttt.com) or
>     [Blogtrottr](https://blogtrottr.com) to get posts delivered by
>     email (here’s [a tutorial on
>     IFTTT](https://www.chronicle.com/blogs/profhacker/send-an-rss-feed-to-your-email-account/50319))
>   - There’s also the `#site-updates` channel on [my
>     Discord](http://beesbuzz.biz/discord) (which is also a fun place
>     to hang out anyway)

which formats like

For folks who were following me on Patreon and don’t have an RSS reader, here are some alternate ways of following me:

  • All my stuff gets automatically posted to Twitter, Tumblr, and Mastodon, although that’s not ideal because updates are really easy to miss on those places
  • You can use IFTTT or Blogtrottr to get posts delivered by email (here’s a tutorial on IFTTT)
  • There’s also the #site-updates channel on my Discord (which is also a fun place to hang out anyway)

(from this entry).

html2text might be better, but that loses the ability to support other output formats. There might also be some better Pandoc configurations that could be used.

fluffy-critter commented 5 years ago

html2text's output isn't great either:

[fluffy](http://beesbuzz.biz/):
[Reblob!](http://beesbuzz.biz/blog/5385-Reblob):

> [Reblob!](http://publ.beesbuzz.biz/blog/179-Reblob):

>

>> It’s been a while since I’ve worked on IndieWeb stuff, but I finally got
around to releasing an _extremely preliminary_ version of
[reblob](http://publ.beesbuzz.biz/tools/1423-reblob), a little commandline
thingus to make this stuff easier. Eventually I’ll also have a server-based
version here, at least as an example.

>

> Of course this is the first entry I’ve written actually _using_ it. Lots of
rough edges but whatever!

which renders as:

fluffy: Reblob!:

Reblob!:

It’s been a while since I’ve worked on IndieWeb stuff, but I finally got around to releasing an extremely preliminary version of reblob, a little commandline thingus to make this stuff easier. Eventually I’ll also have a server-based version here, at least as an example.

Of course this is the first entry I’ve written actually using it. Lots of rough edges but whatever!

tarleb commented 5 years ago

Found this through your tweet. There might be a way to use one of pandoc's many customization options to fix this. E.g., you could try to remove soft line-breaks by using a pandoc filter:

function SoftBreak ()
  return pandoc.Space() -- replace soft linebreak with a space
end

Use by calling pandoc with pandoc --lua-filter=path/to/that/filter-file.lua …. Or check if the --wrap=none option does what you want. Does this help?

fluffy-critter commented 5 years ago

@tarleb Not particularly, the way that pandoc works through Pypandoc makes that incredibly unwieldy. But there's also no reason for that in a Pandoc filter, see the branch https://github.com/PlaidWeb/reblob/tree/feature/5-trim-end-whitespace for a simple fix on the Python side.

But even with that there's a lot of stuff pypandoc does poorly that can't be easily addressed by setting markdown plugins either. The Mastodon version of the thread goes into more about that.

fluffy-critter commented 5 years ago

There's also a bunch of other reasons I want to get off pandoc, like the Python bindings to it make a lot of assumptions about environment that won't work for one of my intended future use cases, and it's just, like, not very well-controlled in general.

I can also think of a fairly straightforward way to convert HTML to Markdown in a way that will also allow me to put in Publ-markdown extensions. I was hoping reblob would be able to also support things like ReStructuredText for folks who use that on their blog engine though.