KhronosGroup / OpenCL-Registry

OpenCL API and Extension Registry.
112 stars 42 forks source link

Many pages in OpenCL SDK 3.0 with duplicate content #147

Open outofcontrol opened 6 months ago

outofcontrol commented 6 months ago

There are 247 pages in the Khronos OpenCL Registry that do not have unique content. The URL is different but content identical. For example:

https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_event.html https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_kernel.html https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_command_queue.html https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_mem.html https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_platform_id.html

Search engines will tend to not index or not like pages with duplicate content unless we have a canoncical tag pointing to the source.

Is there anything that can be done to make the content different on these pages? The following list are all pages that have a matching title, and possibly identical page content. The page title is in the second field.

bashbaug commented 6 months ago

Thanks, this is super helpful!

I believe this is because we redirect a lot of individual pages in the online reference pages (like the math built-in functions) or types (like cl_context) to one common page. This is intentional, but perhaps we need to consider a different solution instead?

CC @oddhack

oddhack commented 6 months ago

I don't know what could realistically be done without massive, massive restructuring of the specifications. Most of the OpenCL C functions, and some of the OpenCL APIs as well, are described in gigantic tables, one row/function or API. It is natural to generate refpages that contain those tables, and redirects from the individual functions to the refpage with the table it's in.

I suppose the scripts could be enhanced to be clever enough to extract out each separate row from a table and combine it with surrounding descriptions, generating a distinct, extremely short HTML page for every function, but that would also be a great deal of work (edit to add: and would be very fragile as it would necessarily make assumptions about how table markup is structured that would periodically break when someone does something new).

ATM all the functions in the table get redirected to the refpage containing them via the .htaccess generated at refpage build time. I think on balance that's the right answer. When I search for a random function like get_local_size in Google or DDG the right thing mostly seems to result, modulo the fact that man.opencl.org is currently not redirecting properly per https://github.com/KhronosGroup/OpenCL-Docs/issues/1053. Can someone explain what was done, and how much control we have over changing what was done, with those redirections?

@outofcontrol can you expand on what "have a canonical tag pointing to the source" would translate to in terms of the generated HTML?

outofcontrol commented 6 months ago

the fact that man.opencl.org is currently not redirecting properly per https://github.com/KhronosGroup/OpenCL-Docs/issues/1053.

The site is owned by Khronos member Stream High Speed Computing. We had requested a redirect which is what we have now. We will attempt to get the redirect updated.

@outofcontrol can you expand on what "have a canonical tag pointing to the source" would translate to in terms of the generated HTML?

Keeping in mind I am not familiar with each of the files contents or how the pages are built, my thought was to have each page that is identical, have a canonical tag point to a single file. For example, all the files with title 'workItemFunctions(3)' in theory are identical (not verified). If we pick one file get_global_offset.html and make it the canonical source, then add to all of the workItemFunctions(3) files:

<link rel="canonical" href="https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/get_global_offset.html ">

This would show search engines that all of the pages with title 'workItemFunctions(3)' are identical and the search engine should only index get_global_offset.html.

Worth mentioning this could also be done manually in the sitemap file, which may be easier than altering the build.

oddhack commented 6 months ago

From my PoV, there are not multiple workItemFunctions files - there is a single file, and a large number of .htaccess redirects to it for the individual functions described in that file. If the sitemap file is something that is directory-specific then potentially we could construct it as part of the refpage build along with .htaccess. Or we could make the canonical file the actual filename, so that workGroupFunctions.html has a canonical tag pointing to workGroupFunctions.html, as do all the redirects to that file under other names. Is there a reasonable way to test this stuff and see what Google will do with it, before committing ourselves?

N.b. the Vulkan refpages also have some redirects like this, although not many compared to OpenCL.