danielthepope / trntxt

:steam_locomotive::train::train: A data-friendly UK train times site
https://trntxt.uk
MIT License
36 stars 9 forks source link

NRCC messages sometimes appear in their own <P> tag #26

Closed danielthepope closed 9 years ago

danielthepope commented 9 years ago

image

Generated HTML is currently

<p class="error">NRCC message:
  <P>Journeys to and from London Waterloo are being disrupted. More information can be found in
    <A href="http://nationalrail.co.uk/service_disruptions/103677.aspx">latest travel news.</A>
  </P>
</p>

Full HTML response on the wiki, for testing

<P> inside <p> :point_right: :fire:

Regex to find <p> tags: /<\/{0,}P>/ig. However we should really be removing all tags except <a> because they're actually useful. I need to up my regex game.

danielthepope commented 9 years ago

/<\/?[^a\/][^>]*>/ig is close, but retains tags beginning with 'a', i.e. <abbr>

danielthepope commented 9 years ago

If for whatever reason, the NRCC message returns a <abbr> tag, I don't want it included in trntxt. I have a failing test that I'd like to pass.

To do: modify the regular expression in removeHtmlTagsExceptA, src/nationalrail.js:172.

In short, <a href="#">pass</a> should return <a href="#">pass</a> (i.e. remain the same), <p>pass</p> should return pass, and <abbr>text</abbr> should return text.

Test with npm test. You don't need a National Rail API key for this one.

benfoxall commented 9 years ago

Would sanitize-html be an option? It seems a bit of an overhead, though I guess it's safer and might be easier to extend down the line.

(obligatory stack overflow link)

benfoxall commented 9 years ago

I guess you would lose line breaks at the end of <p>, if they're important you could map the tags to spans and achieve that with some styling:

clean = sanitizeHtml(dirty, {
  transformTags: {
    'p': sanitizeHtml.simpleTransform('span', {class: 'fake-p'})
  }
});

…would lose some of the semantic niceness, but might let you present it better to a user

danielthepope commented 9 years ago

Hey @benfoxall, thanks for taking a look :smile: And thanks for pointing me in the direction of that library. I'm not going to use it now, but that might well come in handy some point soon.

I've given the problem some thought this evening and I've written a regex that passes my test! Learned quite a bit about regexes in doing so! :smiley: It might well be able to go shorter, but meh.

/<\/?((([^\/a>]|a[^> ])[^>]*)|)>/ig

Regexr is wonderful

image