Add Schema.org or Croissant metadata to header of Dataset view page

ekraffmiller commented 5 months ago

Currently the JSF Dataset page has schema.org info embedded in the header, which in the future may be replaced with Croissant. The SPA version of the page has to replicate this. Here is what it looks like in the JSF Header:

<script type="application/ld+json">{"@context":"http://schema.org","@type":"Dataset","@id":"https://doi.org/10.5072/FK2/SCYB0O","identifier":"https://doi.org/10.5072/FK2/SCYB0O","name":"Testing embargo","creator":[{"@type":"Person","givenName":"Guillermo","familyName":"Portas","name":"Portas, Guillermo"}],"author":[{"@type":"Person","givenName":"Guillermo","familyName":"Portas","name":"Portas, Guillermo"}],"datePublished":"2024-03-14","dateModified":"2024-03-14","version":"1","description":"test","keywords":["Business and Management"],"license":"http://creativecommons.org/publicdomain/zero/1.0","includedInDataCatalog":{"@type":"DataCatalog","name":"Root","url":"https://beta.dataverse.org"},"publisher":{"@type":"Organization","name":"Root"},"provider":{"@type":"Organization","name":"Root"},"distribution":[{"@type":"DataDownload","name":"dataverse_files (2).zip","encodingFormat":"application/zip","contentSize":4540,"contentUrl":"https://beta.dataverse.org/api/access/datafile/26133"},{"@type":"DataDownload","name":"FilesIT.java","encodingFormat":"text/x-java-source","contentSize":154657,"contentUrl":"https://beta.dataverse.org/api/access/datafile/26132"}]}

The Dataverse API for getting this uses the exporter, for Schema.org: https://beta.dataverse.org/api/datasets/export?exporter=schema.org&persistentId=doi:10.5072/FK2/SCYB0O And for Croissant format: https://beta.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi:10.5072/FK2/SCYB0O

To test rich results Search Google Rich Results

pdurbin commented 4 months ago

One concern we have is what to do when the schema.org or croissant files get large, such as 7 MB for a dataset with 25k files. These issues are related:

Also, in JSF we show the schema.org version unless the croissant jar file is present:

https://github.com/IQSS/dataverse/pull/10382

I wrote some docs about this in an (open) pull request:

g-saracca commented 3 months ago

For a quick proof of concept, it would be ideal to do a simple insert of the expected script (hardcoded & type="application/ld+json") in question into the head of the single index.html that handles the SPA. Simply from the home (Collection page), in a useEffect that runs only once, so we can simulate how it would really be the insertion of this script inside the head once the SPA Javascript is loaded and thus confirm through Search Google Rich Results if the script is being detected or not.

As a second approach, if we know that the script is detected, we should detect the persistentId in question through the url of the page of a Dataset, fetch the endpoint mentioned with the persitentId and insert the result in a script type “application/ld+json” in the header of the html. And when the user navigates away of the page, in the return of the useEffect that will be executed when this component/page is unmounted, delete the script in question. (This only if we are not in a mobile device, this could be detected in a very simple way at the moment through the screen width.)

useEffect(() => {
  const contentOfTheScriptToInsert = fetchToLoadScript()

  // Insert the script into the head of the document here...

  return () => {
    // Remove the script from the head of the document here...
  };
}, []);

ekraffmiller commented 2 months ago

beta.dataverse.org has been updated with a robots.txt to allow all, so now https:/beta.dataverse.org is being crawled successfully, but individual dataset pages are not being indexed by Google. See this page for the Rich Results test: https://search.google.com/test/rich-results/result?id=XS1bhHFD7CEtXP5vHMIxog. Putting it back in This Sprint for further investigation, since it's a lower priority for Q2.

g-saracca commented 2 months ago

Moving it to the backlog due to a problem with the server configuration for the SPA redirection. Currently when entering directly to a SPA url other than the main /spa/ it is returning the index.html document but with a 404. This is because of web.xml located on frontend repo under deployments/payara/ is handling urls that dont belong to an actual file or folder as an error page and returning index.html with a 404 Not Found page status, making it not crawlable.

  <error-page>
    <error-code>404</error-code>
    <location>/index.html</location>
  </error-page>

This problem must be solved in order to return to this issue.

cmbz commented 1 month ago

2024/07/10

Removing the On Hold status and moving back to SPA classification

IQSS / dataverse-frontend

Add Schema.org or Croissant metadata to header of Dataset view page #350