TryGhost / Ghost

Independent technology for modern publishing, memberships, subscriptions and newsletters.
https://ghost.org
MIT License
47.17k stars 10.26k forks source link

Improvements to the {{excerpt}} helper #5060

Closed ErisDS closed 7 years ago

ErisDS commented 9 years ago

It's unquestionable, given the vast number of issues, PRs, forum posts, support requests and other mentions of our {{excerpt}} helper, that it's leaving a lot to be desired.

Yet, as demonstrated by the wide range of different ideas on how it should be improved, it's hard to find consensus on what 'better' actually looks like. Having looked through the issues, discussions, PRs and what themes are currently using, there are two broad categories into which the concerns fall into: the first being improving the excerpt that Ghost generates from the content, and the second is adding features for custom excerpts.

Custom excerpts are a niche requirement, and we want to focus our efforts on making apps a possibility so that it is possible to add an excerpt field, or a subtitle field, or a standfirst field or whatever custom field suits your use-case, rather than adding these things to core. Therefore this issue exists purely to address the former - improving the generated excerpts.

There are two key things which makes the current {{excerpt}} helper's output quite undesirable:

  1. It cuts off mid-sentence (ugly)
  2. It strips all formatting (confusing)

The formatting problem has meant many themes and Ghost users are using {{content}} instead of the {{excerpt}} helper. This is not ideal as it outputs images and other media that don't really make sense for an excerpt as well as resulting in the need for some sort of appending feature to make it possible to display read more links. It also still doesn't solve the cut off problem.

To solve the cut off problem, we want to introduce {{excerpt paragraphs="X"}}, which will return the first X text paragraphs from the content.

To solve the formatting problem, it makes sense to change the helper so that it leaves valuable links and text formatting in place, rather than stripping all HTML.

The combination of these two things working in tandem should lead to a better excerpt. The media, script and all other non-formatting tags should be stripped from the content first, and then from the remaining text content we can return as many paragraphs as the {{excerpt}} helper requires.

Moving forward, I think the {{excerpt}} helper's default should also be changed from words="50" to 'paragraphs="1"`.

When processing HTML in this way, it's important to do it in such a way that bad HTML doesn't trip up the code. At the moment we make heavy use of the very smart downsize library for truncating HTML, however the excerpt helper does it's own brute force stripping of HTML. Therefore this is likely to need a bit of a rethink.

The following is a list of elements that will be permitted in excerpts:

a, abbr, b, bdi, bdo, blockquote, br, cite, code, data, dd, del, dfn, dl, dt, em, i, ins, kbd, li, mark, ol, p, pre, q, rp, rt, rtc, ruby, s, samp, small, span, strong, sub, sup, time, u, ul, var, wbr

This list was generated from https://developer.mozilla.org/en/docs/Web/HTML/Element, and includes all block and inline text formatting elements. Once all other elements are removed, the first X paragraphs should then be returned, not including any empty paragraphs.

In the long term, the excerpt tag allowlist will become extensible via a filter, so that extensions to the editor can also declare additional elements that should appear in excerpts (I'm thinking of things like MathML here).

In the short term, the next step here is to review the downsize library and determine whether these features can be added or whether this needs a bit of a re-think.

novaugust commented 9 years ago

not including any empty paragraphs

That means paragraphs that were stripped empty, ie <p><img src="..."></p> -(remove unallowed tags)-> <p></p> wouldn't count towards the paragraph quota

^ meant as a question, but not well worded as such ;)

ErisDS commented 9 years ago

Exactly that :+1:

adam-zethraeus commented 9 years ago

it's cool to see this task become more formalized. I'm not going to try to be involved in any implementation, but i'm familiar with downsize and worked on the earlier attempt at this and I'd be happy to talk to anyone who picks this up.

Some food for thought that I think could be useful:

ErisDS commented 9 years ago

what type of 'bad html' could make it to the excerpt helper post markdown encoding? is there any direction w/r/t to this aside from making the system robust?

I mentioned 'bad html' purely as an indicator that replacing downsize with a regex would be undesirable, I'd like to keep the existing functionality we have in terms of handling that at least a little bit.

does anything other than a <p> count as a paragraph? i.e. does the markdown parser output <blockquote> or <code> blocks that are not <p> wrapped.

This is a really good question, and one that needs a little bit of experimentation. <blockquote> definitely counts as a paragraph - the way showdown does this is put a <p> inside it which is somewhat odd. I'd suggest a simple rule-of-thumb would be any of the permitted excerpt tags (which includes <blockquote> and <code>) should be considered a paragraph if they exist at the paragraph level.

there's a good chance that downsize could be extended to support what's described here, or would make a good starting point, but if this task only involves counting <p> tags, it might just make the task heavier.

I think it definitely makes a really great starting point, and I really hope we can continue to extend it to support more advanced cases. I think the task at hand is sufficiently more than counting <p> tags as to make it worthwhile?

galori commented 9 years ago

@ErisDS Is there any alternative, recommended way to control or curate what is displayed on an index page? Wether you call it an excerpt, summary or a standfirst.

For example, is there a way to add a standfirst field that is populated separately and is displayed instead of an excerpt?

FWIW from my experience, in many larger organizations (publications), there is a need for tighter curation and control over appearance. Having a post break at an automated point to provide a preview would not fly and typically the desire is either to place the "break" manually, or if to take it step further to write the copy separately. For larger organization, this might even be a separate editor that writes the preview paragraph.

In my case, I'm migrating to ghost for my personal blog and am planning on writing both extensive technical posts that appeal only to developers and also opinions about the tech industry which appeal to a larger audience. I would prefer to break the more technical posts before any of the tech speak or code starts and just provide a small intro, to allow any non-interested reader to skip over it and to prevent it from alienating non-techies.

Even though I'm just checking out Ghost for my personal blog, I have worked in tech / development for large content publishers for many years and this seems to be pretty consistent.

(Perhaps the decision is that ghost is tailored to the more individual blogger, amateur publications and small orgs? But from everything I'm seeing about Ghost thats not a formal decision).

ErisDS commented 9 years ago

Hi @galori, appreciate your feedback. As I have attempted to explain in the issue here - the problem with the concept custom excerpts is that everyone wants or means something different when they think about how it might work. That makes it difficult to implement a 'one size fits all' solution. We have a wishlist where feature requests can be posted and voted on - both a custom field and a more-tag style cutoff have been proposed and neither has much traction.

It seems you've already opted to patch Ghost with the more tag style option. Other options is to add an HTML block with a class to the top of your post, and use JS or CSS to show/hide that part of your content as and when you like, or to tag your technical posts with a particular tag, and show a much shorter excerpt for posts with that tag.

mixonic commented 8 years ago

For those who want manual control over excerpts, the wishlist item can be found here:

dubeg commented 8 years ago

@ErisDS Shouldnt it be a list of tags to strip, instead of a list of tags to allow? Are there more tags other than img and hr that you'd like to see removed?

Even then, I still think it is weird for the theme to dictate what an excerpt should be made of. Like "x" number of paragraphs, or "y" number of words. I feel that writers should be able to decide what goes in it and what doesn't. Be it by specifying a special tag in the post after which the content is stripped, or perhaps by adopting a rule like "everything before the first heading goes in the excerpt".

I agree that most of the time excerpts are only the first paragraph of the post, but specifying "1" paragraphs at the theme level is not flexible for the end user.

Edit: I didnt read your first post well enough, I see that eventually you'd like to add 3rd-party support for manipulating excerpts. But we could still add additional options to downsize, like the possibility to truncate by reaching a specific tag.

Right now, I've been able to implement paragraph counting, but I'm not sure how to process when stripping tags. When stripping inline elements, we probably want to keep the inner text, and when reaching block elements, we probably want to strip the inner text as well. Right?

ErisDS commented 8 years ago

This issue exists purely as a collection of the information I gathered on how our existing excerpt feature should be improved. Custom excerpts are a completely separate feature & you can vote for them on the wishlist.

As for whether or not we should allowlist or blocklist tags, the issue specifies a clear allowlist which includes only tags that would produce sensible markup inside a small block area, which is what an excerpt is. There are many, many tags that do not make it into the list, including everything to do with tables, forms and structure.

Edit, just saw the added question:

When stripping inline elements, we probably want to keep the inner text, and when reaching block elements, we probably want to strip the inner text as well. Right?

I believe only the tags should be stripped, not the content, with the exception of tags which indicate the content is not relevant, which is anything in the multimedia, embeds & scripting sections here + <iframe> and <style>.

devsnek commented 8 years ago

In PR 6706 I am addressing as much of the above as I can.

devsnek commented 8 years ago

@ErisDS i have just added the html stripping feature that you requested in PR 6706 and it works on my personal blog. Travis is building it now.

Also, I think that we should definitely keep downsize. It works well, and it keeps the html safe. Plus, it already has paragraph rounding. it could be forked and edited to include sentence rounding.

I think the biggest concern with writing something else from scratch is load time. Most ghost themes heavily use {{excerpt}} or {{content}}, and putting bad loops or something like that would seriously increase load time. I'm going to fork downsize (guscaplan/downsize) and try to implement paragraphs and sentence rounding. I should have something by monday.

devsnek commented 8 years ago

hello! all of the things that are mentioned here (and more!!!) are in my latest pull request. This PR adds html splitting, append options, and sentence rounding!!!! right now sentence rounding only works if you specify the length in characters, not words. e.g. {{excerpt characters="140" round="true"}} will work but {{excerpt words="50" sentenceRounding="true"}} will just skip sentence rounding. also, {{excerpt sentences="5"}} is also a thing.

dubeg commented 8 years ago

@GusCaplan I think the biggest feature to be desired here is truncation by nbr of paragraphs. I tried starting with the original downsize, but I found the changes to be non-trivial and the code wasn't very clean in any case. So, in my attempt, I pratically kept nothing from the original codebase, but If you want, you can take a look at what I did and perhaps you'll find a better way to do the same thing.

eexit commented 8 years ago

Hi there,

I got an idea but it would require some extra work and probably a lot of tests:

  1. Add a new optional custom Markdown syntax to generate a <summary> HTML tag
  2. Excerpt helper would pick up this tag (why not using cheerio) where ever it is and whatever its content length is
  3. Use also cheerio to remove unwanted tags (<style>, <script>, etc.)
  4. Return the sanitized <summary> as it

Making this would remove any magic brought by theme implementation because when the user types his post, he deliberately knows what he puts in his <summary> and this being aware of stripped tags (either from Ghost doc and the MD help modal).

Stripping at first, second or whichever paragraph sounds good but let's be honest here: it's a lot error prone and will make the code maintainers a living hell as much as users who have unexpected results.

filipecatraia commented 7 years ago

@eexit:

The HTML <summary> element is used as a summary, caption, or legend for the content of a <details> element.

shaps80 commented 7 years ago

I'm currently updating my website to use Ghost and I've built a small Ghost-App that allows me to specify the excerpt with greater control per post:

This paragraph will **NOT** appear in the excerpt.

<extract>
The `extract` tags will be removed from the rendered HTML and a stripped version of this block will be used as the post excerpt.
</extract>

The rest of my post would go here and **not** appear in the excerpt.

Then in my post.hbs for example I can simply use {{extract}} to access the stripped excerpt.

It also supports a single <extract /> tag in which case it just behaves more like a tag.

I used the extract tag to ensure I don't mess with any existing HTML tags, thus not breaking semantics and allowing me to remove it from the rendered HTML before presentation.

In the event a post has not used these tags, a fallback based on the current {{excerpt}} implementation will be used. Defaulting to words=26 but easily configurable via the app.

I'm not a Node, Javascript or Web developer at all. But it satisfies my needs. If anyone else is interested in using this in the meantime, feel free to use it.

https://github.com/shaps80/ghost-app-extract

ErisDS commented 7 years ago

Closing this issue in favour of the new custom excerpt #8793. We can revisit improving the automatic excerpt some other time if there is more demand for it.