backdrop / backdrop-issues

Issue tracker for Backdrop core.
144 stars 40 forks source link

[D8] Trim summary on word boundary #599

Closed ghost closed 1 year ago

ghost commented 9 years ago

As per https://github.com/backdrop/backdrop-issues/issues/309, can we also add in this functionality: Field Word Boundary? This module allows trimming to word boundaries, not just sentences or paragraphs. Seems like a logical addition to core...

(I think the problem this solves is most often seen when the trim length is quite short (e.g. 100 characters) and the first sentence is longer than this - this means that nothing gets output).

ghost commented 5 years ago

Made a PR that copies the essence of Field Word Boundary into core: https://github.com/backdrop/backdrop/pull/2708

stpaultim commented 4 years ago

@BWPanda - Please correct me if I am wrong, but it seems like the right solution, but to the wrong problem.

As far as I can see, the summary or trimmed option on a display mode currently only allows the user to select a specific number of characters. It breaks at that exact number of characters, even if it's in the middle of a word. It does not provide an ellipse. It does not appear to give any option to break at sentences or paragraphs.

So, I think we need this, because breaking in the middle of a word isn't very helpful.

This text "Trim to the nearest word, rather than the nearest paragraph, line break, or sentence" is confusing to the user (me), because I'm not seeing the supposed ability to break at nearest paragraph, line break, or sentence. I'm only seeing the ability to break at the exact character.

Post__Teaser_display___PR_2708_backdrop_backdrop_testing_site

In my tests, the PR did break at the end of a word (as promised) and provide the option for an ellipse.

I would say "yes" to the feature, just fix the wording (unless I'm misunderstanding something). It makes me wonder if we should provide the ability to trim at character, word, or sentence?

Is that a follow-up issue?

I think leaving the sentence option to contrib would be fine, but we should have some other option than just trimming at a specific character in core.

klonos commented 4 years ago

I'm also in favor of implementing this feature in core 👍 ...this basically uses the already existing truncate_utf8() function in core, and exposes its settings in the UI.

ghost commented 4 years ago

It breaks at that exact number of characters, even if it's in the middle of a word.

@stpaultim Not always. Try this:

This is because it can't find the end of the paragraph or a line break in the first 120 characters. The next thing it looks for is a full stop, which it finds at the beginning, so that's where it trims to.

Now turn on word boundary and see how it trims much closer to the specified 120 characters.

klonos commented 4 years ago

FTR, our Drupal brethren have this issue to add word boundary trimming in D8 core (also tagged for backporting to D7): https://www.drupal.org/project/drupal/issues/1482178

Last comment at the time of writing this, is by @wimleers

We should simply remove the automatic trimming and summary features from Drupal 8. They are broken by design.

This issue clearly demonstrates how trimming is broken for technical reasons. But it's also broken at a more fundamental level: trimming can never be smart enough (until we have strong AI in Drupal 8 core, which will never ever make sense).

klonos commented 4 years ago

This is because it can't find the end of the paragraph or a line break in the first 120 characters. The next thing it looks for is a full stop, which it finds at the beginning, so that's where it trims to.

There's an issue for that as well, against D8 core: https://www.drupal.org/project/drupal/issues/1620104

stpaultim commented 4 years ago

Ok, I misunderstood what was happening, because I started testing with really long sentences and expected it to trim the entire sentence.

But, as you say, it seems to be currently set up to trim at the nearest paragraph (assuming at least one paragraph), trim at first line break (if less than one paragraph), or trim at sentence (if no line breaks available), or to trim at exact character count if less than one sentence.

I think having word break as an option makes sense.


1 - I believe that there is a problem with the current PR. If I turn on "Trim on word boundary," I don't think it will ever display more than one paragraph. I tested several times with several short paragraphs (one sentence each) and a large character limit and still never got more than the first paragraph/sentence.


2 - As usual, I think we need to improve help text. I think without improved help text above, the word boundary option is harder to understand.

WAS: If the summary is not set, the trimmed Body field will be shorter than this character limit.

PROPOSED: If the summary is not set, the Body field will be trimmed to the nearest character, full sentence, or full paragraph (not to exceed the character limit).

Post__Teaser_display___PR_2708_backdrop_backdrop_testing_site


FINAL NOTE: I believe that there are other/better ways of improving this feature, but we would run into backwards compatibility problems. The idea behind this PR is good, because it adds another useful option without effecting the current functionality for those who don't select it.

ghost commented 4 years ago

We should simply remove the automatic trimming and summary features from Drupal 8. They are broken by design.

I don't agree with this for Backdrop. I don't see any issues with this, just ways we can improve it.

If I turn on "Trim on word boundary," I don't think it will ever display more than one paragraph.

It you turn on 'word boundary', then you're bypassing the whole 'look for a paragraph first, then look for line breaks, etc.'. It's one or the other currently. I was thinking it'd be good to merge the two systems so that it tries to break at a paragraph, then line break, then word boundary, then full stop, etc., but it's hard to work out which should take priority... E.g. is it better to break at word level near the character limit, or after a paragraph away from the word limit? I think this is what D8 mentioned about needing an AI.

But I don't think we need to get that involved. I think just let users choose whether to try to find the best place to trim (the current system) or choose to always trim to word boundary (this PR). As for fixing the documentation around the current system, I think that deserves a seperate issue.

stpaultim commented 4 years ago

It you turn on 'word boundary', then you're bypassing the whole 'look for a paragraph first, then look for line breaks, etc.'. It's one or the other currently.

I think you misunderstood my point. In my testing, once I turned on "trim to word boundary" it always cut off after one paragraph, even if the limit was 500 characters and the first paragraph is only one short sentence. I don't think that was your intention. Even if I had three short paragraphs, once sentence each, it still cut off after the first one. (EDIT: in further testing, it does not do this if the total number of characters is less than the limit, but it does in the total is greater than the limit).

See how the current node: "/posts/test" is behaving (in PR sandbox) with "trim to word boundary" turned on. It's set to show 600+ characters, but only one very short sentence is showing up.

ghost commented 4 years ago

In my testing, once I turned on "trim to word boundary" it always cut off after one paragraph, even if the limit was 500 characters and the first paragraph is only one short sentence.

Hmm, interesting. Will investigate...

ghost commented 4 years ago

Ok, so it seems that either we can't use truncate_utf8() here, or else it has a bug that needs fixing...

The problem is that truncate_utf8() has regex that basically looks like this: (.{1,599}). This looks for 1-599 ({1,599} - 599 being the 600 character limit I was using minus the one-character ellipsis) instances of any character (.). However the . looks for any character except newlines. So the trimmed text will never be longer than the position of the first linebreak.

In your case @stpaultim, you had a linebreak after the first sentence, so that's where the trimming stopped.

Not sure if this is a bug with truncate_utf8() or just how it's supposed to work. If the latter, we need a different solution here...

stpaultim commented 2 years ago

Someone is reporting a problem in the forum that I assume is related to this issue: https://forum.backdropcms.org/forum/how-enlarge-site-teaserintroduction

ghost commented 2 years ago

Thanks for bringing my attention back to this @stpaultim. I've updated the PR to fix the issue with trimming at the first line break even when the character limit is longer than that. I did so by adding a new optional (for backwards compatibility) parameter to truncate_utf8() that adds the dotall modifier if set, thereby allowing line breaks.

ghost commented 2 years ago

Also, I'm happy to advocate for this, so adding the milestone candidate tag.

klonos commented 2 years ago

Thanks @stpaultim and @BWPanda 🙏🏼 ...I think that this would be a very nice addition to core. Let's try to get it into 1.21.0.

stpaultim commented 2 years ago

@BWPanda - Thanks for working on this issue. I tried to break the main functionality, but I couldn't. I would mark it "works for me" except for the following two points:

1) It is breaking in the middle of words. It's not clear to me if that is your intention, but it is suggested that it might be difficult to break on words and so maybe breaking at specific character is fine as along as the "ellipses" option is there. However, I don't think we should promise to break at the nearest word, if that is not what is happening. Should it be"Trim at specific character" or "Precise trim" instead?

2) I think that this is probably acceptable, but I want to be sure that we are aware that it is counting html as well. If I bold a word and then refresh the page, I will see fewer characters before the ellipses because of the extra charaters for the html tags that have been added.

I don't think that #2 should block this issue, but I wanted to point it out in case @BWPanda had something different in mind.

ghost commented 2 years ago

Thanks for the feedback/review @stpaultim!

  1. You found a bug! I had added the new $include_newlines parameter into the wrong place in the function call (it was missing the value for $min_wordsafe_length), so once I fixed that in my PR it now seems to trim at word boundaries correctly. Can you confirm that now?
  2. Yep, there's a separate issue for that (which is actually the issue that spawned this one): https://github.com/backdrop/backdrop-issues/issues/309 I agree that that functionality is needed, however I think it's another can of worms entirely...

I also added documentation for the new parameter as @klonos pointed out. PR updated.

stpaultim commented 2 years ago

I'll mark this as "Works for Me". I was unable to break it beyond the known limitations that seem to be outside the scope of this issue. This looks like a nice improvement that certainly would make life easier for this user: https://forum.backdropcms.org/forum/how-enlarge-size-teaserintroduction

Thanks @BWPanda !

herbdool commented 2 years ago

Code looks fine. Though I think a test should be added for this variation here: https://github.com/backdrop/backdrop/blob/1.x/core/modules/field/modules/text/tests/text.test

ghost commented 2 years ago

PR updated with tests. Please let me know if I did (or didn't do) it right...

stpaultim commented 2 years ago

I'm removing the previous "Works for Me" - since "Needs Testing" has been added.

stpaultim commented 2 years ago

@BWPanda - Sorry, but I accidentally stumbled across a weird glitch.

The option to add an elllipse does not work if the break point falls on a link.

Steps to recreate:

Here is some sample text (with html markup). Set the word trim to 600 or 625 or 635 with ellipse and there is no ellipse, increase the word count to 645 and the ellipse will appear on the words following the link, as expected.

<p>Zebras&nbsp;are primarily grazers and can subsist on lower-quality vegetation. They are preyed on mainly by lions and typically flee when threatened but also bite and kick. Zebra species differ in social behaviour, with plains and mountain zebra living in stable harems consisting of an adult male or stallion, several adult females or mares, and their young or foals; while Grévy's zebra live alone or in loosely associated herds. In harem-holding species, adult females mate only with their harem stallion.&nbsp;</p>

<p>Zebras are primarily <a href="https://en.wikipedia.org/wiki/Grazing_(behaviour)" title="Grazing (behaviour)">grazers</a>&nbsp;and can subsist on lower-quality vegetation. They are preyed on mainly by&nbsp;<a href="https://en.wikipedia.org/wiki/Lions" title="Lions">lions</a>&nbsp;and typically flee when threatened but also bite and kick. Zebra species differ in&nbsp;<a href="https://en.wikipedia.org/wiki/Sociality" title="Sociality">social behaviour</a>, with plains and mountain zebra living in stable&nbsp;<a href="https://en.wikipedia.org/wiki/Harem_(zoology)" title="Harem (zoology)">harems</a>&nbsp;consisting of an adult male or&nbsp;<a href="https://en.wikipedia.org/wiki/Stallion" title="Stallion">stallion</a>, several adult females or&nbsp;<a href="https://en.wikipedia.org/wiki/Mares" title="Mares">mares</a>, and their young or&nbsp;<a href="https://en.wikipedia.org/wiki/Foals" title="Foals">foals</a>; while Grévy's zebra live alone or in loosely associated herds. In harem-holding species, adult females mate only with their harem stallion, while male Grévy's zebras establish&nbsp;<a href="https://en.wikipedia.org/wiki/Territory_(animal)" title="Territory (animal)">territories</a>&nbsp;which attract females and the species is&nbsp;<a href="https://en.wikipedia.org/wiki/Promiscuity" title="Promiscuity">promiscuous</a>. Zebras communicate with various vocalisations, body postures and facial expressions.&nbsp;<a href="https://en.wikipedia.org/wiki/Social_grooming" title="Social grooming">Social grooming</a>&nbsp;strengthens social bonds in plains and mountain zebras.</p>

I stumbled across this by randomly cutting and pasting some wikipedia text into the body field, which contains a lot of links. This is probably a fringe case, so I don't know if it's a serious problem or not?

I left this test node in the sandbox to experiment with.

Other than this problem, it seems to be working.

ghost commented 2 years ago

Sorry, but I accidentally stumbled across a weird glitch.

Then you should watch where you're walking... 😜

Seriously though, thanks for finding this. I'll take a look.

ghost commented 2 years ago

Hmm, weird. I can't reproduce this in the first paragraph:

image

So maybe it only happens in the second paragraph...? In which case maybe it's something to do with the $include_newlines thing I added...

stpaultim commented 2 years ago

@BWPanda - not sure if you saying that you are unable to recreate it.

I have an example set up in the sandbox. There should be an ellipse right after "primary":

image

ghost commented 2 years ago

@stpaultim Yep, but I can only reproduce that behaviour if the link/ellipsis is in the second paragraph. Can you reproduce this without a paragraph break?

stpaultim commented 2 years ago

Yes, I have reproduced it with one paragraph. It's all set up in the sandbox.

image

image

ghost commented 2 years ago

Thanks @stpaultim, I see what you mean.

I'm going to tentatively say this isn't a bug with this PR/change, but an existing problem... For example, try running the following code on a regular Backdrop site from /admin/devel/php:

$string = '<p>Zebras are primarily grazers and can subsist on lower-quality vegetation. They are preyed on mainly by lions and typically flee when threatened but also bite and kick. Zebra species differ in social behaviour, with plains and mountain zebra living in stable harems consisting of an adult male or stallion, several adult females or mares, and their young or foals; while Grévy`s zebra live alone or in loosely associated herds. In harem-holding species, adult females mate only with their harem stallion. Zebras are primarily <a href="https://en.wikipedia.org/wiki/Grazing_(behaviour)" title="Grazing (behaviour)">grazers</a> and can subsist on lower-quality vegetation.</p>';

dpm(_filter_htmlcorrector(truncate_utf8($string, 600, TRUE, TRUE, 1)));

You'll get the same problem. What happens is that truncate_utf8() truncates the text at the appropriate spot (in this case it's in the middle of an opening <a> tag), then adds the ellipsis. Then _filter_htmlcorrector() runs and tries to fix the odd HTML (it finishes the opening <a> tag and adds a closing <a> tag). Since there's no text between the opening and closing <a> tags, the link doesn't appear. And the ellipsis in somewhere in the <a> tag's HTML, so it doesn't show either.

I think fixing this behaviour requires a separate issue...

stpaultim commented 2 years ago

"I think fixing this behaviour requires a separate issue..."

Sounds good. If this is the case, then "Works For Me"

I opened this follow-up issue: https://github.com/backdrop/backdrop-issues/issues/5406

ghost commented 2 years ago

I'm having second thoughts about this approach... I'm starting to work on https://github.com/backdrop/backdrop-issues/issues/309, and I feel like both of these issues are going to have very similar solutions. Such that maybe it's better to combine them... Indeed, Drupal's Field Word Boundary module (on which this issue/PR is based) is now deprecated in favour of Filed HTML Trim.

So don't review/commit yet. I'll see how I feel about this once I have a PR for https://github.com/backdrop/backdrop-issues/issues/309 ready.

klonos commented 2 years ago

@BWPanda during today's dev meeting, @jenlampton mentioned that she's been having better results with https://github.com/backdrop-contrib/smart_trim, so it might be worth checking the implementation they are using.

ghost commented 2 years ago

The way Smart Trim strips HTML markup before trimming the content (https://github.com/backdrop-contrib/smart_trim/blob/67b3c4b1ad4306252c0d83eb37f27a990d2f5d6f/smart_trim.module#L63) is based on Alex Verbruggen's code (https://www.drupal.org/project/drupal/issues/221257#comment-7994743) which he used in his contrib module, Field HTML Trim.

The problem is that he eventually realised this streamlined version doesn't account for alphanumeric characters properly and so doesn't provide consistent results (https://www.drupal.org/project/field_html_trim/issues/2655540#comment-10804866). He ended up removing his streamlined code from Field HTML Trim and instead now uses much longer (but more stable) code based on CakePHP's implementation (https://git.drupalcode.org/project/field_html_trim/-/blob/7.x-1.x/field_html_trim.module#L202).

So I don't think we should put the Smart Trim implementation into core...

ghost commented 2 years ago

Bumping this to the next milestone. Don't want to rush this...

stpaultim commented 2 years ago

@BWPanda - Just checking to make sure you saw the discussion in the dev meeting on Thursday. The general feeling was that maybe we should go ahead and commit your current fix, which is a step in the right direction and deal with the other pre-existing issues later.

However, I'll leave it to you to decide which option is best. If you think we can get this fixed soon, it might be better to wait and get it right, but if we're going to risk another year long delay, I would be in favor of the small step forward ASAP.

We have had a report in the forum of this effecting folks upgrading from Drupal.

ghost commented 2 years ago

I saw, thanks. I want to re-evaluate this, even if we do just action this issue without worrying about HTML trimming. I'll try to get to this soon and detail the situation and my ideas. Not for this release though.

ghost commented 2 years ago

So, here it is:

There exists a function (field_html_trim_truncate()) in the Field HTML Trim D7 module (that is based on a function in CakePHP). As described above this seems to be the most robust way to truncate text - taking into account word boundaries, HTML trimming, ellipsis, etc.

Since this issue is just about trimming to word boundaries, we could try and pull that part of the code out of that function. But if we also want to add the ability to trim without counting HTML characters (https://github.com/backdrop/backdrop-issues/issues/309), then we'll likely need to add most of that function back in later anyway... And vice-versa if https://github.com/backdrop/backdrop-issues/issues/309 gets in before this issue does.

So ideally, we'd just add in that function in its entirety and cover both issues at once. I therefore recommend opening a new issue for putting D7's Field HTML Trim module into Backdrop core (or specifically, the functionality it provides), and then closing this and https://github.com/backdrop/backdrop-issues/issues/309 as duplicates of the new issue.

Thoughts?

ghost commented 2 years ago

Removing myself as advocate as I don't think I'll be getting back to this anytime soon. Happy for anyone else to take over.

ghost commented 1 year ago

I'm closing this issue in favour of https://github.com/backdrop/backdrop-issues/issues/6001