adobe / helix-html-pipeline

A library for rendering the html response in Helix3.
https://www.hlx.live/
Apache License 2.0
13 stars 16 forks source link

Improve SEO description extracted from documents. #705

Open buuhuu opened 2 weeks ago

buuhuu commented 2 weeks ago

Is your feature request related to a problem? Please describe.

Currently, if not included in the metadata of a page, the description is taken from the first paragraph with more than 10 words and not a link:

https://github.com/adobe/helix-html-pipeline/blob/3d3e5dc5e39a601a5f9b225d6409f2cfbdefc138/src/steps/extract-metadata.js#L129-L148

That condition is not precise enough, as it for example

Describe the solution you'd like

I would propose to consider multiple paragraphs if the 10 words criteria is not meet and include headings. For example as page starting with

## Pronađite raspoložive BMW automobile sa lagera.

Odaberite onaj koji najbolje odgovara vašim potrebama.

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Stock Locator Model Overview                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| ## Pogledajte detalje                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| [/content/dam/metafox/rs/sr/disclaimer-pool/stocklocator/stocklocator-info-icon](/assets/rs/sr/disclaimer-pool/stocklocator/stocklocator-info-icon)   |
|                                                                                                                                                       |
| ### {count} od {count} vozila                                                                                                                         |
|                                                                                                                                                       |
| [/content/dam/metafox/rs/sr/disclaimer-pool/stocklocator/stocklocator-disclaimer](/assets/rs/sr/disclaimer-pool/stocklocator/stocklocator-disclaimer) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| NEMA PRONAĐENIH VOZILA                                                                                                                                |
|                                                                                                                                                       |
| Nažalost, nisu pronađena vozila koja odgovaraju Vašim kriterijumima. Molimo Vas da resetujete filtere i napravite drugačiji izbor.                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+

Should have the description Pronađite raspoložive BMW automobile sa lagera. Odaberite onaj koji najbolje odgovara vašim potrebama.

In any case a SEO description should probably start with a alpha-numeric character.

Describe alternatives you've considered Updating the descriptions, but that requires the content team and takes time.

Additional context slack conversation ff

tripodsan commented 2 weeks ago

Updating the descriptions, but that requires the content team and takes time.

I think most of the customers are very conscious about SEO and always provide a tailored description via the page metadata. it's probably rarely the case, that the 1st paragraph is a good description. IMO we should remove that feature....

davidnuescheler commented 2 weeks ago

i agree with @tripodsan that a lot of customers set descriptions explicitly, and the automatic description (alongside with automatic og:image) is often problematic. i personally, don't like the heuristic approach we use here, and think that this is something that we could possibly get to a more declarative approach with templated metadata.

i would argue that especially in BYOM the automation of description can be more project specific and intelligent easily without a possible regression risk for existing sites.

on a tangent, why do we support URLs that start with a / in the first place, seems counter to https://www.aem.live/docs/davidsmodel#rule-4-fully-qualified-urls-only ? especially as there is no way to produce those word or gdoc. maybe we should possibly look into limiting things a little bit more tightly to make sure that BYOM content can easily edited in all authoring environments, to allow users to transition between and mix authoring environments.

tripodsan commented 1 week ago

@buuhuu can you change the links to absolute url?

buuhuu commented 1 week ago

Not easily. The authors select content using a picker and don't paste URLs as they would in Word authoring. We consider that fair use of the capabilities that AEM as a CMS offers and are reluctant to change the authoring experience. And before we get back into a discussion if we should only allow them to author absolute links - there are use cases where they author references that are not previewed/published like launches.

Making the links absolute programatically isn't straight forward either, as at the time some content is published we may not know the host yet, and changing it requires to republish everything.

Having said that it is not about the links actually, but the link text. This is poor implementation by the partner. They should have authored a link text for these.