OpenLiberty / openliberty.io

Open Liberty website
https://openliberty.io
Other
54 stars 40 forks source link

Redirect javadoc URLs for the package iframe to the full javadoc page #2665

Open kinueng opened 2 years ago

kinueng commented 2 years ago

Problem

One of the iframe content from the javadocs are being indexed by search engines.

Recreation steps

I subset of URLs that should not be indexed by search engines.

Possible Solutions 1

Possible Solution 2

Redirect these URLs to the appropriate page. The decision needs to be made which page is considered the "appropriate". Here are some choices

  1. https://openliberty.io/docs/latest/reference/javadoc/liberty-jakartaee9.1-javadoc.html?package=jakarta/enterprise/inject/package-frame.html&class=overview-summary.html
    • Drawback is only the bottom left iframe changes to jakarta.enterprise.inject

image

  1. https://openliberty.io/docs/latest/reference/javadoc/liberty-jakartaee9.1-javadoc.html?class=jakarta/enterprise/inject/package-summary.html&package=allclasses-frame.html
    • Drawback is only the right iframe changes to jakarta.enterprise.inject

image

  1. https://openliberty.io/docs/latest/reference/javadoc/liberty-jakartaee9.1-javadoc.html?package=jakarta/enterprise/inject/package-frame.html&class=jakarta/enterprise/inject/package-summary.html
    • This looks to be the best end result solution where all frames are in context for jakarta.enterprise.inject package.

image

kinueng commented 2 years ago

Found similar URLs that users should not be using but they are showing up in search results

Search example https://www.google.com/search?q=Package+jakarta.batch.runtime.context+-+Open+Liberty image

kinueng commented 2 years ago

Our thoughts are that the code that handles a URL like https://openliberty.io/docs/modules/reference/liberty-jakartaee9.1-javadoc/jakarta/batch/api/AbstractBatchlet.html and redirects it to https://openliberty.io/docs/latest/reference/javadoc/liberty-jakartaee9.1-javadoc.html?package=jakarta/batch/api/package-frame.html&class=jakarta/batch/api/AbstractBatchlet.html is unable to handle the two broken URL examples in the issue description.

Start with looking at the code that handles transforming the URLs and redirecting.

Sreejith-Websphere commented 2 years ago

Hi @kinueng noticed that for this issue for all the jakarta/javaee/microprofile which ends frame.html is not redirecting eg: https://openliberty.io/docs/modules/reference/liberty-javaee8-javadoc/overview-frame.html https://openliberty.io/docs/modules/reference/liberty-javaee8-javadoc/allclasses-frame.html https://openliberty.io/docs/modules/reference/liberty-javaee8-javadoc/allclasses-noframe.html https://openliberty.io/docs/modules/reference/liberty-javaee8-javadoc/overview-frame.html https://openliberty.io/docs/modules/reference/liberty-javaee8-javadoc/javax/annotation/package-frame.html

https://openliberty.io/docs/modules/reference/microprofile-5.0-javadoc/overview-frame.html https://openliberty.io/docs/modules/reference/microprofile-5.0-javadoc/allclasses-frame.html

https://openliberty.io/docs/modules/reference/liberty-jakartaee9.1-javadoc/overview-frame.html https://openliberty.io/docs/modules/reference/liberty-jakartaee9.1-javadoc/jakarta/activation/package-frame.html

kinueng commented 1 year ago

The last remaining piece is to decide how to mark the javadoc files used for iframes as noindex to avoid search engines indexing the iframes. Example of the iframe files are in comment https://github.com/OpenLiberty/openliberty.io/issues/2665#issuecomment-1170319592. We cannot put the HTML element noindex into the files because the files are generated by a script command javadoc.

natalie-bernhard commented 4 months ago

We may be able to use rules in our robots.txt to prevent crawlers from seeing the iframes by setting rules specific to these files still causing issues (overview-frame.html, package-frame.html, allclasses-frame.html). We can also target the bot that indexes for Google Search specifically if using the * user agent prevents the iframes from loading on the site.

More info here: https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt#create_rules