Open landreev opened 3 years ago
Where the export error happens:
SchemaDotOrgExporter.java
calls version.getJsonLd();
to produce the json string. That part works. However, before caching this output in a file, exportDataset()
attempts to parse the string - to validate it, presumably? - and that's where it fails.
What happens at the end of version.getJsonLd()
is
jsonLd = job.build().toString();
//Most fields above should be stripped/sanitized but, since this is output in the dataset page as header metadata, do a final sanitize step to make sure
jsonLd = MarkupChecker.stripAllTags(jsonLd);
return jsonLd;
MarkupChecker.stripAllTags()
uses jsoup methods to sanitize the result; and that's where all the "
entities are turned into unescaped double quotes. Thus invalidating the json.
A quick fix would be to add a regex to change "
into escaped double quotes (\"
), before calling stripAllTags()
.
Calling stripAllTags();
on the generated json string is inherently problematic though. A cleaner way would be to apply it to the individual fields used to cook the json.
From @pdurbin:
I think what doesn’t quite sit right to me is that we have this line: jsonLd = job.build().toString(); … which should always created properly escaped JSON. But the later we do this: jsonLd = MarkupChecker.stripAllTags(jsonLd); … which has the potential to munge the JSON until it’s invalid.
But, seeing how this is the first issue of this kind we've run into - is it worth it? - tbd.
Highly related, if not a duplicate:
This struck again, and it it cost us some time talking about it on slack.
I'll push to have this prioritized in the next sprint and make a quick pr fixing it.
Note that it doesn't need to be in TOU, just insert "
into the description, or any other field to reproduce.
In Harvard Dataverse, v5.13, I was able to publish a dataset that had Put "ditto" marks around it.
in several metadata fields (title, description, notes, and terms of use). The dataset was published and I was able to view the Schema.org export through the UI.
Does that mean this issue was fixed? I'm not sure if the first of the two issues that @landreev wrote about, about the method getJsonLd() in DatasetPage.java needing to catch an exception from the export, has been resolved.
There are 2 issues in one:
"
resulting in a failure to export and cache the jsonLd/SchemaDotOrg format. (Will add the details in the next comment, to keep the description compact).To reproduce - screenshot from @pdurbin: