elastic / docs

246 stars 333 forks source link

Malformed document page when URL query contains slash #2214

Open aplhk opened 3 years ago

aplhk commented 3 years ago

I came across a few links from Google search and found out that precedence of slash (/) in the URL query string will lead to malformed / unresponsive document page.

Example of malformed page: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html?example.com/a

I believe the root cause is in the TOC fetching script: https://github.com/elastic/docs/blob/5b6ac7928c141d9eebeb13d078501a5e77d64d13/resources/web/docs_js/index.js#L253-L260

In this case location.href is https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html?example.com/a, and after replacing the string it will fetch and append https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html?example.com/toc.html which causes infinite loop and unresponsive page.

gtback commented 3 years ago

Thanks, @aplhk! 🙇🏻 I can reproduce the error you're seeing.

Are you able to share the Google search and/or pages that directed you to that URL? They seem malformed in the first place, so in addition to fixing the behavior, I'd like to fix the URLs at the source if that's something we have control of.

aplhk commented 3 years ago

I think this Google dork cover some of the URLs: site:www.elastic.co/guide inurl:ref

https://www.google.com/search?q=site%3Awww.elastic.co%2Fguide+inurl%3Aref

gtback commented 3 years ago

Thanks again, @aplhk!

@AnneB-SEO Do you know where these URLs might be coming from? I don't think we use the ?ref= query parameter anywhere within the docs. Are we able to tell Google not to index these sorts of URLs? I can work on the underlying code that's causing the infinite loop.

AnneB-SEO commented 3 years ago

Do you know where these URLs might be coming from?

I'll need to look into it but upon quick glance it looks like the links could coming form 3rd-party sites, like hackermoon.co and driverlayer.com

I don't think we use the ?ref= query parameter anywhere within the docs.

Likely not

Are we able to tell Google not to index these sorts of URLs?

Yes, but only when we are adding the parameters. If they are coming from a 3rd-party, then we can't instruct Google to ignore them

Let me look into it and also yet loop in @brianjolly for good measure : )

brianjolly commented 3 years ago

It looks like Google's URL Parameters tool might be able to help.

https://support.google.com/webmasters/answer/6080548

It says the requirements for using the tool are:

Would you say this issue falls in that category?

gtback commented 3 years ago

Thanks, @brianjolly , that looks promising. I'd want to first confirm that the equivalent pages are getting indexed without the ?ref parameter, but if so, I think we can tell it to ignore any pages with a ref query param.

AnneB-SEO commented 2 years ago

@brianjolly & @gtback - The parameter exclusion only applies to pages we create versus pages created by others. Even so I added the ref parameter on 9/14

Ref-parameter-exclusion-added-09-14-2021

AnneB-SEO commented 2 years ago

This problem is more extensive and expanding. When this was originally raised there were ~7 URLs from 2 different site (hackermoon.co and driverlayer.com). Today there are over 80 and more than docs are being targeted including Elasticon.

We'll need to file a DMCA takedown notice with Google thru Legal based on:

Ref-parameter-SERPs-hackermoon-09-22-2021

Thanks for finding and raising @aplhk aplhk. Let's leave this one open until we file. Thanks all!!!