flarum / framework

Simple forum software for building great communities.
http://flarum.org/
6.36k stars 835 forks source link

Google indexes a discussion page using content from other pages #3370

Open handymenny opened 2 years ago

handymenny commented 2 years ago

Bug Report

Current Behavior

Google's crawler indexes page X of a discussion (discussion/?page=X) using also the text/images from other pages. This means that what Google's crawler sees doesn't always match what users see once they open that page.

Example

Google "ok, ma io ho fatto questi test a scopo informativo/illustrativo site:forum.fibra.click"

image Second result is page 12, but that post belongs to page 10

Expected Behavior Google's crawler should only look at posts on a specific page, i.e. in the above example Google should link to page 10

Environment

Possible Solutions

  1. (Hacky) Disable javascript for all crawlers (limited to discussions)
  2. Disable posts auto-loading for crawlers, adding a button "load more" that points to the next page
  3. Disable posts auto-loading for all users, adding a button "load more" that enables posts auto-loading
matteocontrini commented 2 years ago

I also notice that in the second result in the screenshot the number of posts shown is wrong. It says 20 but that's not the total number of posts in the discussion.

askvortsov1 commented 2 years ago

(2) seems like a reasonable solution. The button already exists, so we'd just need to add a userAgent check as to whether a given user is a bot or not before triggering autoload.

Another solution could be updating the canonical url as the page scrolls, but:

I'll also note that there's a bit more duplication than I'd like, as it seems like search engines have both the old post number-based links and the new page-based links stored. But that should get resolved naturally over time.

I also notice that in the second result in the screenshot the number of posts shown is wrong. It says 20 but that's not the total number of posts in the discussion.

This one has me a bit stumped. I'm not sure where exactly Google is getting this information; it doesn't seem to show up on search results for NodeBB or Discourse communities, and from a few quick Google searches I'm not seeing whether there's a meta tag we could use to provide accurate information on this. @Hari-Bonda @jaspervriends or anyone else knowledgable in SEO, any ideas as to where this is coming from?

davwheat commented 2 years ago

Not sure if the SEO extension embeds JSON-LD, but that might be it?

matteocontrini commented 2 years ago

Not sure if the SEO extension embeds JSON-LD, but that might be it?

The extension has an option for listing the posts in JSON-LD but I haven't enabled it for performance reasons (I'm the owner of the forum of the example above). Also, I'm pretty sure that the posts count and the other metadata have been there even before the extension existed, so I guess it's just Google magic.

Hari-Bonda commented 2 years ago

i am aware of this issue since Aug 2021, we had to disable indexing subpages (posts) until this issue gets resolved so we went with Discussion Canonical URL extension as a temporary solution https://github.com/SychO9/flarum-discussion-canonical-url

Coming to the solution.

When flarum is breaking discussions into pages for example

page=?near1 page=?near2 page=?near3

you should change the page title too (test discussion page 1 of 2) & (test discussion page 2 of 2)

this title was implemented in flarum v1.2 and using linguist I have made a few tweaks to get the page title output what I need

the second most important thing is when the sum of posts are considered as page page=?near2 , for that particular page flarum should maintain the first post (in case of second page 21st post) content as page description.

let us say if the second page is displaying the 20th post as the first post for the second page you should display the 20th post content as a page meta description for that page if you fail to do this you will end up seeing duplicate titles and description for all pages which is a huge SEO mistake google will consider as you are trying to fool the spider with duplicate content and gets confused to rank the main discussion page.

i don't know how JSON or JS is working but maintain a different DB or something to get the data like I have mentioned.

i am not an expert but if you observe WordPress pages you can easily notice this.