Open jfrank-nih opened 2 years ago
@zhuomingao, @blairl-nih thought this might be a Nutch config issue based on the web app content type pages being blank and not getting included.
I'm not sure if that's the case as (after talking with Blair) I did find the other two dictionaries in the search results, albeit with "Untitled" as the title.
This ticket is low priority as the user can get to the dictionary via search, just not directly. Also, if you'd rather have it over in sitewide-search-app let me know.
The reason that www.cancer.gov/publications/dictionaries/cancer-terms doesn't get returned in top results is that the title tag is missing. Nutch can't find the title tag in the HTML head and thus the indexed result is missing title field which in the most important field in calculating ranking. And pages that are missing title tag will have "untitled" if displayed in search result. The solution is to add title tag for these application module pages so they can be returned as top results.
Huh. That's odd. The window has a title so it must be getting set in JavaScript by the react app itself. Thanks @zhuomingao. I'll find the correct location to move this ticket out to then.
I wonder if this is the case for all our react apps...
@jfrank-nih That makes it a platform issue. Looking at the page source, there's no <title>
element, but looking in the CMS, both the "Page Title" and "Browser Title" fields are set.
Issue description
When searching for the dictionary of cancer terms (or the other dictionaries) the page that actually contains the dictionary is not the top (or very near the top) of the search results. In the case of the dictionary of cancer terms it's not even showing up in the first 5 pages.
Per @zhuomingao the reason for this is that the
The reason the
removeHeadElements
inside thedrupalConfig
settings on the react app page. And the reason for that is because having the title tag present was causing Googlebot to incorrectly penalize the indexed SPA pages as duplicative results. See NCIOCPL/cgov-digital-platform#2929.In theory per Google including a
A solution for both Google and Nutch simultaneously would be a prerendering service (i.e. prerender.io) which would allow us to serve up fully formed HTML pages to crawlers rather than the SPA.
Steps to reproduce the issue
What's the expected result?
What's the actual result?
Additional details / screenshot