MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
13 stars 22 forks source link

Bioschemas JSON-LD not properly encoded/escaped #316

Closed sneumann closed 1 year ago

sneumann commented 2 years ago

Hi, we got a report from @AlasdairGray that:

In preparation for the BioHackathon next week, we have been harvesting data from as many sites as possible. Whilst harvesting the pages from MassBank, we found 10,326 pages with invalid JSON-LD on them. From the page that I inspected, this was due to the use of quotation marks within a text field with the quotation mark not being properly encoded. For example, you can see the error at the following link to the Schema.org syntax validator https://validator.schema.org/#url=https%3A%2F%2Fmassbank.eu%2FMassBank%2FRecordDisplay%3Fid%3DMSJ00172

A fix probably requires proper encoding of strings in https://github.com/MassBank/MassBank-web/blob/12f119605ac63d1574bffa4b670a46ce9b9ab564/MassBank-Project/MassBank-lib/src/main/java/massbank/Record.java#L663

Yours, Steffen

sneumann commented 2 years ago

A light-weight choice could be https://stackoverflow.com/a/22756976/2974851 but note comment on slashes (we have both URLs and InChIs containing slashes ...) https://stackoverflow.com/questions/3020094/how-should-i-escape-strings-in-json/11610833#comment86039244_22756976

tsufz commented 2 years ago

Ah, good to know, I checked the crawlers, they do also complain:

Parsing error: Missing ',' or '}

Example: image

tsufz commented 2 years ago

The second error is a bad escape sequence in the SMILES string:

image

sneumann commented 1 year ago

With the proper serialisation this can now be closed. Thanks Rene, yours, Steffen