Portal site-specific styling being passed through to JKAN site

JackGilmore commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce Our dataset pipelines take raw HTML from the descriptions of some datasets which means that they can often be littered with various tags that mess with the styling when outputted on opendata.scot (e.g. <h1> or any tag with a style) property. This can sometimes produce unexpected results like large text being outputted from header tags.

Expected behavior Some of these styles or tags could be simplified (e.g. we could convert all header tags to just be bold and underlined)

Screenshots Example from https://opendata.scot/datasets/dundee+city+council-housing+available+now/

Hardware and software used N/A

Additional context Whilst unlikely to happen, I have concerns that this could leave us vulnerable to XSS (cross-site scripting) attacks if we ended up loading JavaScript <script> tags in the description of datasets we pull from other websites. See this relevant article where someone registered an XSS attack payload as a company name on Companies House which had the knock on effect of XSSing websites that consumed data from the Companies House API: https://www.theregister.com/2020/10/30/companies_house_xss_silliness/

nutcracker22 commented 2 years ago

@JackGilmore: what do you think of implementing the code from the top answer of the following stackoverflow post: Strip HTML from strings in Python? It seems to remove the vulnerability for XSS attacks, but it would take the possibility from us to replace certain tags the way you mentioned.

JackGilmore commented 2 years ago

@nutcracker22 That could be a good shout. I would say that dealing with the styling issues being passed through is more important as there's a fairly low risk of us getting XSSed by public sector websites.

I was just thinking about this during the week and was wondering if we could use Markdownify and just convert the HTML to markdown? Our website uses Jekyll anyways so rendering markdown should be supported out the box (I say should because I haven't tested this).

Should be easy enough to implement by converting the description HTML to markdown during the export2jkan.py step of the pipeline.

OpenDataScotland / jkan

Portal site-specific styling being passed through to JKAN site #20