investigate why Google Dataset Search is not indexing datasets

andrewsu commented 3 years ago

It appears that of the 39 datasets currently at https://discovery.biothings.io/dataset, Google Dataset Search is only indexing the Wellderly dataset: https://datasetsearch.research.google.com/search?query=site%3Adiscovery.biothings.io

Not sure if it's the only reason, but rich results test on https://discovery.biothings.io/dataset/da4905854c18028d gives a "Page partially loaded" error...

marcodarko commented 3 years ago

I've tried the older tool to test if the metadata is not being loaded https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fdiscovery.biothings.io%2Fdataset%2Fda4905854c18028d This seems to work so I'm wondering if the newer testing tool doesn't wait for dynamically embedded content..

andrewsu commented 3 years ago

Interesting, I was under the impression that Rich Results Tool would work better with dynamic content. Regardless, it was my understanding that Rich Results Tool better approximates what their crawler does. (Seemed to be confirmed by this blog post: https://webmasters.googleblog.com/2020/07/rich-results-test-out-of-beta.html...)

marcodarko commented 3 years ago

"It handles dynamically loaded structured data markup more effectively" hmm, I'll do more research on this there's gotta be a similar case somewhere.

andrewsu commented 3 years ago

@marcodarko any update on this ticket? I wonder for example, if you put the rendered HTML in a static file, would Rich Results properly parse it? That might tell us whether it's something in our JSON-LD versus the dynamic loading...

marcodarko commented 3 years ago

@andrewsu yes, so one possibility is that the page is requesting a lot of resources and this tool doesn't wait for them so you get the Partially loaded error and ultimately the tool not detecting any metadata. I'm gonna revisit this page and offload as much stuff as possible and make it as light as possible. Google does have a URL inspection tool and a PageSpeed insights tool so I'm gonna monitor the output and try again. I don't think our site is bloated with resources but it's worth a shot and from reading similar issues seems to be the issue.

marcodarko commented 3 years ago

Just an update, not necessarily good news:

Hmm so I made the page as light as possible and removed all Loading Issues we had previously and improved the page performance and I still get the same error.

One thing I found that was weird is that this https://discovery.biothings.io/dataset/83dc3401f86819de dataset works for rich results but none of the others work. I compared the outbreak ones to this one and I noticed the @type for Wellderly had no prefix, so I did a test and removed the prefix from all and tested again but that was not the issue. It would be weird for that to be the problem but I wanted to try it anyway.

So basically, still looking into why these are not showing as having rich results. Now that the page is clear I hope I can find the reason faster...

marcodarko commented 3 years ago

@andrewsu @newgene Ok, so turns out the rich results tool doesn't like:

prefixes on @types
anything else other than: "@context": "http://schema.org/"

I think that means that maybe I have to modify both the @type and @context ? obviously that wouldn't be 100% true to it's origins so that's the con.

I compared the same metadata above and that confirms the test I did. I'll also include the screenshots below: Passing Wellderly dataset

Failing Outbreak dataset

Passing Modified Outbreak Dataset

I used the same tool but tested pasting the script tag containing the json-ld and selecting the 'code' option instead of url in order to be able to modify and re run.

marcodarko commented 3 years ago

To confirm that it expects the context to be schema.org:

marcodarko commented 3 years ago

After fixing all loading issues and making all datasets appear to be schema:Dataset derived, we have submitted the dataset sitemap for indexing on Oct 28th, and was marked as successful with 44 discovered urls on the same day.

However, it appears that those 44 had already been indexed before so they don't show up as being indexed recently (Oct 28), so no changes detected may have resulted in the being skipped??

andrewsu commented 3 years ago

I'll just note that when I created this ticket, Dataset Search had one dataset, yesterday it had three, and today it has four. So trending in the right direction... 🤞 https://datasetsearch.research.google.com/search?query=site%3Adiscovery.biothings.io

andrewsu commented 3 years ago

... and back down to two indexed datasets... :(

andrewsu commented 3 years ago

Overall, we're still stuck at ~4 indexed datasets~ (EDIT 2021-01-19) 10 indexed datasets (https://datasetsearch.research.google.com/search?query=site%3Adiscovery.biothings.io) out of 58 currently available. Just a deep dive on one particular dataset (https://discovery.biothings.io/dataset/da4905854c18028d), the one mentioned in the first comment in this issue.

Observation 1: this dataset is not indexed in Google Dataset Search (https://datasetsearch.research.google.com/search?query=site%3Adiscovery.biothings.io&docid=s9UmBdKO4VvS4gDUAAAAAA%3D%3D)
Observation 2: this URL is successfully crawled by the Google crawler
Observation 3: The rich results tool has a problem parsing our structured metadata, likely due to the dynamic injection of metadata? See the vuex.js loading error in the screenshot below, and note that's different than the schema-related issues that @marcodarko posted above. Bottom line, I think something is still going on with google handling our dynamic content...
Observation 4: Plugging the dynamically-rendered HTML into the Rich Results tool led to two warnings and one error. Focusing on the error, I don't quite get the error message. It says the citation shouldn't be of type ScholarlyArticle, but https://schema.org/citation says it can be either a CreativeWork or Text, and https://schema.org/ScholarlyArticle is clearly a subclass of CreativeWork.

marcodarko commented 3 years ago

Small update: Updated that page and removed all libraries that could be replaced with vanilla js eg. Vuex and Axios. that removes the loading issues but still get the other warnings and error for @types on Observation 4.

biothings / discovery-app

investigate why Google Dataset Search is not indexing datasets #20