SEO: resolve potential duplicite content

marekcierny commented 8 years ago

Several examples of potential duplicite content exist:

shift between language versions: https://anatom.cz/en/ - https://practiceanatomy.com/
user registration: anatom.cz/view/LE/?sessionid= - anatom.cz/view/LE/
view of a particular image https://anatom.cz/view/04/?context=svaly-krkusvg - https://anatom.cz/view/04/

Duplicite content should be a) avoided if possible, b) resolved by redirect 301, or C) resolved by <link rel="canonical" (https://support.google.com/webmasters/answer/139066).

slaweet commented 8 years ago

has been resolved by redirect 301 for some time, but google still didn't reindex it. More recently I tried to disallow /en/ and /cs/ urls in robots.txt
has been resolved for some time as well, it's just still hanging in google index
is a TODO

marekcierny commented 8 years ago

I would argue against disallowing /en/ and /cs/ urls in robots.txt, as any link to an URL which robots cannot access leads to loss of page rank (it might prevent it from seeing the redirect).

Ultimately, I think C) meta "canonical" should be added to every page to resolve any potential duplicate content we might miss... (E.g. tracking campaigns and traffic sources)

marekcierny commented 8 years ago

I wrote a simple PHP function that rewrites any url into a "canonical url". canonical.TXT

If "echo get_canonical_meta($url)" can be added into every page , it can help us explain to search engines our duplicite content.

papousek commented 8 years ago

Unfortunately, the application is written in Python, so we can not include your script into every page view directly. On the other hand, I assume we are able to rewrite it into Python (@slaweet?)

slaweet commented 8 years ago

I added canonical urls (https://github.com/adaptive-learning/anatomy/commit/f9a74540f5909696570687e7e6145c312b413bd1). I'm just stripping query string (everything after ?). I changed /overview/?tab=location to /overview/tab/location because of it. I didn't implement the part with changing domain in canonical, because it wouldn't get ever executed, because the 301 redirect gets executed first and then we are on the correct domain.

slaweet commented 8 years ago

As for disallowing /en/ and /cs/ I removed it from robots.txt, but I don't see why it should influence page rank of any other page then the ones with /en/ and /cs/, which we don't want in search results anyway. And IMO we don't want Google to see the redirect, but directly the alternative language version through <link rel="alternate" ...

marekcierny commented 8 years ago

OK. As for disallowing /en/ and /cs/ in robots: http://webmasters.stackexchange.com/questions/54240/is-it-safe-to-block-redirected-but-still-linked-urls-with-robots-txt (In general, my understanding is dissallowing robots to any url we link to within our site is not good.)

The canonical form of the url is also related to <link rel="alternate" sitemap: only canonical forms of urls should be linked as another language version. For example, on https://anatom.cz/practice//, the canonical url is https://anatom.cz/practice/, and the alternate languagesshould also end onlz with one /.

slaweet commented 8 years ago

I've updated <link rel="alternate" (https://github.com/adaptive-learning/anatomy/commit/1d33303af3718a526b0f67a16b8def5436faafcf), even though I don't think it matters what is on the non-canonical pages, as Google is only going to look at (index) the canonical ones. I've also added '//' -> '/' replacement to canonical url.

marekcierny commented 8 years ago

Thank you, Víťo. Do you use www.google.com/webmasters/tools/ to check for SEO warnings/errors? (I think it's a great tool, especially as we want to ad more languages and content in the future.) I've just noticed that when logged in, the view-source:https://anatom.cz/ shows canonical address "https://anatom.cz/overview/". But when logged off, it's correct.

marekcierny commented 8 years ago

I might be too picky, but other potential duplicate content is

url with "/" and without "/" at the end. (e.g. https://anatom.cz/practice/A [chapter selected with a tick] https://anatom.cz/practice/A/ [chapter selected with click on an arrow])
selection of chapters for practice (e.g. https://anatom.cz/practice/09/LE and https://anatom.cz/practice/LE/09 [the second url accessible from anatom.cz/view/LE/ - vybrat podkapitolu])

slaweet commented 8 years ago

view-source:https://anatom.cz/ for logged in users actually redirects to view-source:https://anatom.cz/overview (notice address bar). Hopefully, search engines cannot log in :-)

I use www.google.com/webmasters/tools/ every now and then, I haven't noticed any SEO warnings or errors there. I've linked Webmaster tools with GA, so it probably displays the errors in GA as well.

Ad 4 and 5: I see the problem, I'll have to think about how to solve it technically.

marekcierny commented 8 years ago

Although there is no link to such a page, not sure if this could be problem for search engines or users/brand/security: https://anatom.cz/overview/V%C3%ADt%C3%A1%20v%C3%A1s%20blbe%C4%8Dek https://anatom.cz/view/02/V%C3%ADt%C3%A1%20v%C3%A1s%20blbe%C4%8Dek (random url parameter is recognized as canonical, and the random text is displayed in heading)

slaweet commented 8 years ago

Re https://github.com/adaptive-learning/anatomy/issues/19#issuecomment-171920436: Good catch. That URL is actually a link to view knowledge of a user, e.g. https://anatom.cz/overview/slaweet https://anatom.cz/overview/cierny.m

The problem is that we don't do the check if the given string is a valid username. If not, then the page should return an error.

marekcierny commented 8 years ago

Víťo, when I suggested to make a separate url for /overview/?tab=location in order to get the crawler see our main content tree, I didn't know that google can understand AJAX. Now I think it wasn't a good idea from the start, and we might be better without it. I am sorry to make it complicated.

slaweet commented 8 years ago

Marku, I don't think Google AJAX crawling scheme is applicable here. Anything we want to appear in search results (like /overview/?tab=location) has to be on a separate url.

slaweet commented 8 years ago

And FYI, your example with "Vítá vás blbeček" has been indexed by google as Google crawled our Github :-) FYI no.2 the problem with SEO in GA was just reporting issue and was caused by http -> https migration in December. Our impressions changed to https vesion of anatom.cz and those were not listed.

marekcierny commented 8 years ago

First, I am concerned we have very similar content (and identical ) when user view in image under different chapters/body parts (eg. practiceanatomy.com/view/UE/image/casti-lidskeho-telasvg and practiceanatomy.com/view/LE/image/casti-lidskeho-telasvg). Can we change the url to practiceanatomy.com/view/LE/#image/casti-lidskeho-telasvg or practiceanatomy.com/view/LE/#image/5 ?

Second, I've found a simple SEO guide, and there are several things we do not do yet:

description tag is the same over the site. Can we make it unique for the pages we would like users to land on?
useful (custom) 404 page (suggested text here)
HTML sitemap and #17 XML sitemap.

adaptive-learning / anatomy

SEO: resolve potential duplicite content #19