commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Use json-ld for document description #38

Open Tpt opened 8 years ago

Tpt commented 8 years ago

I'm not sure if it is relevant but it may be interesting to use json-ld [1] for document description, may be even with the schema.org [2] vocabulary.

For example https://en.wikipedia.org/wiki/Jean-François_Champollion could be represented by (in an amazing future when Common search will be able to guess page topics):

{
  "@context": "http://schema.org",
  "@type": "WebPage",
  "@id": "https://en.wikipedia.org/wiki/Jean-François_Champollion",
  "name": "Jean-François Champollion - Wikipedia, the free encyclopedia",
  "description": "Jean-François Champollion (a.k.a. Champollion le jeune; 23 December 1790 – 4 March 1832) was a French scholar, philologist and orientalist",
  "inLanguage": "en",
  "image": "//upload.wikimedia.org/wikipedia/commons/thumb/6/6c/Leon_Cogniet_-_Jean-Francois_Champollion.jpg/220px-Leon_Cogniet_-_Jean-Francois_Champollion.jpg",
  "dateModified": "2016-03-01T14:47:00",
  "fileFormat": "text/html",
  "mainEntity":{
    "@type": "Person",
    "@id": "http://www.wikidata.org/entity/Q260",
    "name": "Jean-François Champollion"
  }
}

[1] http://json-ld.org [2] http://schema.org

sylvinus commented 8 years ago

Thanks @Tpt!

Where in the pipeline do you think such a description would make sense? As output of an API?

Tpt commented 8 years ago

Yes, it would be very nice as API output but it also could be used as the documents format inside of Elasticsearch.

sylvinus commented 8 years ago

Ok, let's keep it in mind for the API!

For Elasticsearch I don't think there would be any benefit. We should optimize for Elasticsearch storage and that will look very different from json-ld.

dalf commented 7 years ago

There is this google tool which extracts more than json-ld : https://search.google.com/structured-data/testing-tool

For example : https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fwww.tripadvisor.fr%2FAttraction_Review-g187103-d4744042-Reviews-Centre_Historique_de_Rennes-Rennes_Ille_et_Vilaine_Brittany.html

https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Flinuxfr.org%2Fnews%2Fse-passer-de-google-facebook-et-autres-big-brothers-2-0-1-les-moteurs-de-recherche

Some documentation about it : https://developers.google.com/search/docs/guides/enhance-site