NCIOCPL / cgov-digital-platform

The Cancer.gov Digital Communications Platform
GNU General Public License v2.0
11 stars 33 forks source link

React app pages that remove title are heavily penalized in search results #3261

Open jfrank-nih opened 2 years ago

jfrank-nih commented 2 years ago

Issue description

When searching for the dictionary of cancer terms (or the other dictionaries) the page that actually contains the dictionary is not the top (or very near the top) of the search results. In the case of the dictionary of cancer terms it's not even showing up in the first 5 pages.

Per @zhuomingao the reason for this is that the tag is missing and that is the most important thing for Nutch to work on when crawling.</p> <p>The reason the <title> tag is missing is that it is removed using <code>removeHeadElements</code> inside the <code>drupalConfig</code> settings on the react app page. And the reason for that is because having the title tag present was causing Googlebot to incorrectly penalize the indexed SPA pages as duplicative results. See NCIOCPL/cgov-digital-platform#2929. </p> <p>In <em>theory</em> <a rel="noreferrer nofollow" target="_blank" href="https://developers.google.com/search/docs/advanced/javascript/javascript-seo-basics">per Google</a> including a <title> tag and then updating it isn't a problem but practice was proving to be different.</p> <p>A solution for both Google and Nutch simultaneously would be a prerendering service (i.e. prerender.io) which would allow us to serve up fully formed HTML pages to crawlers rather than the SPA. </p> <blockquote> <p><strong>ESTIMATE</strong> TBD</p> </blockquote> <h3>Steps to reproduce the issue</h3> <ol> <li>Go to www.cancer.gov</li> <li>Search for "dictionary of cancer terms" (<a rel="noreferrer nofollow" target="_blank" href="https://www.cancer.gov/search/results?swKeyword=dictionary+of+cancer+terms">https://www.cancer.gov/search/results?swKeyword=dictionary+of+cancer+terms</a>) <ul> <li>Note that the top result is the widget (<a rel="noreferrer nofollow" target="_blank" href="https://www.cancer.gov/widgets/termdictionarywidgetenglish">https://www.cancer.gov/widgets/termdictionarywidgetenglish</a>)</li> <li>Note that the second result is the dictionaries <em>landing</em> page (<a rel="noreferrer nofollow" target="_blank" href="https://www.cancer.gov/publications/dictionaries">https://www.cancer.gov/publications/dictionaries</a>)</li> </ul></li> <li>Look for a link to the actual dictionary itself (<a rel="noreferrer nofollow" target="_blank" href="https://www.cancer.gov/publications/dictionaries/cancer-terms">https://www.cancer.gov/publications/dictionaries/cancer-terms</a>)</li> </ol> <h3>What's the expected result?</h3> <ul> <li>A link to the actual page that houses the cancer terms dictionary will be in the first page of results</li> </ul> <h3>What's the actual result?</h3> <ul> <li>There is no link to the actual page in the first 5 pages of results, at least</li> <li>I did find links to the genetics-dictionary and cancer-drug dictionary on the 3rd page of the results and they had "Untitled" as the link title</li> </ul> <h3>Additional details / screenshot</h3> <ul> <li><img referrerpolicy="no-referrer" src="https://user-images.githubusercontent.com/93668289/144292198-47096393-9019-4d27-9f3e-636ad430413c.png" alt="Screen Shot 2021-12-01 at 1 26 56 PM" /></li> <li><img referrerpolicy="no-referrer" src="https://user-images.githubusercontent.com/93668289/144292200-22a99621-2974-4b6c-b2fb-65f95cace5b0.png" alt="Screen Shot 2021-12-01 at 1 26 37 PM" /></li> </ul> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jfrank-nih"><img src="https://avatars.githubusercontent.com/u/93668289?v=4" />jfrank-nih</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>@zhuomingao, @blairl-nih thought this might be a Nutch config issue based on the web app content type pages being blank and not getting included.</p> <p>I'm not sure if that's the case as (after talking with Blair) I did find the other two dictionaries in the search results, albeit with "Untitled" as the title.</p> <p>This ticket is low priority as the user <em>can</em> get to the dictionary via search, just not directly. Also, if you'd rather have it over in sitewide-search-app let me know.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zhuomingao"><img src="https://avatars.githubusercontent.com/u/22478221?v=4" />zhuomingao</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>The reason that www.cancer.gov/publications/dictionaries/cancer-terms doesn't get returned in top results is that the title tag is missing. Nutch can't find the title tag in the HTML head and thus the indexed result is missing title field which in the most important field in calculating ranking. And pages that are missing title tag will have "untitled" if displayed in search result. The solution is to add title tag for these application module pages so they can be returned as top results.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jfrank-nih"><img src="https://avatars.githubusercontent.com/u/93668289?v=4" />jfrank-nih</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>Huh. That's odd. The window has a title so it must be getting set in JavaScript by the react app itself. Thanks @zhuomingao. I'll find the correct location to move this ticket out to then.</p> <p>I wonder if this is the case for all our react apps...</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/blairlearn"><img src="https://avatars.githubusercontent.com/u/21063180?v=4" />blairlearn</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>@jfrank-nih That makes it a platform issue. Looking at the page source, there's no <code><title></code> element, but looking in the CMS, both the "Page Title" and "Browser Title" fields are set.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>