IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 494 forks source link

Html "\"" in TOU breaks SchemaDotOrg export; and causes a 500 in dataset page. #8224

Open landreev opened 3 years ago

landreev commented 3 years ago

There are 2 issues in one:

  1. (simple, but important) The method getJsonLd() in DatasetPage.java needs to catch an exception from the export, as to NOT kill the page with a 500 when it fails, for whatever reason.
  2. The specific issue of an html entity " resulting in a failure to export and cache the jsonLd/SchemaDotOrg format. (Will add the details in the next comment, to keep the description compact).

To reproduce - screenshot from @pdurbin:

Screen Shot 2021-11-08 at 11 55 31 AM

landreev commented 3 years ago

Where the export error happens:

SchemaDotOrgExporter.java calls version.getJsonLd(); to produce the json string. That part works. However, before caching this output in a file, exportDataset() attempts to parse the string - to validate it, presumably? - and that's where it fails.

What happens at the end of version.getJsonLd() is

jsonLd = job.build().toString();

//Most fields above should be stripped/sanitized but, since this is output in the dataset page as header metadata, do a final sanitize step to make sure
jsonLd = MarkupChecker.stripAllTags(jsonLd);

return jsonLd;

MarkupChecker.stripAllTags() uses jsoup methods to sanitize the result; and that's where all the " entities are turned into unescaped double quotes. Thus invalidating the json.

A quick fix would be to add a regex to change " into escaped double quotes (\"), before calling stripAllTags().

Calling stripAllTags(); on the generated json string is inherently problematic though. A cleaner way would be to apply it to the individual fields used to cook the json.

From @pdurbin:

I think what doesn’t quite sit right to me is that we have this line: jsonLd = job.build().toString(); … which should always created properly escaped JSON. But the later we do this: jsonLd = MarkupChecker.stripAllTags(jsonLd); … which has the potential to munge the JSON until it’s invalid.

But, seeing how this is the first issue of this kind we've run into - is it worth it? - tbd.

pdurbin commented 2 years ago

Highly related, if not a duplicate:

landreev commented 1 year ago

This struck again, and it it cost us some time talking about it on slack. I'll push to have this prioritized in the next sprint and make a quick pr fixing it. Note that it doesn't need to be in TOU, just insert " into the description, or any other field to reproduce.

jggautier commented 1 year ago

In Harvard Dataverse, v5.13, I was able to publish a dataset that had Put "ditto" marks around it. in several metadata fields (title, description, notes, and terms of use). The dataset was published and I was able to view the Schema.org export through the UI.

Does that mean this issue was fixed? I'm not sure if the first of the two issues that @landreev wrote about, about the method getJsonLd() in DatasetPage.java needing to catch an exception from the export, has been resolved.