Q726kbXuN / nytxw_puz

Turn NY Times crosswords into Across Lite files
The Unlicense
41 stars 10 forks source link

Decide how to handle HTML in clues #15

Closed jkboyce closed 2 years ago

jkboyce commented 2 years ago

Today's (Oct. 10, 2021) Sunday puzzle contains clues like:

4 Across: The universe has an estimated 10<sup>82 </sup> of them 23 Across: <i>Field of Dreams</i>

As of now, any HTML contained in clues is passed through unchanged. How this is handled on the user side depends on the client. Black Ink doesn't process any HTML tags, and shows them verbatim to the user. downforacross.com shows the <sup> tag verbatim but applies the <i></i> tag to the rendered clue, yielding italicized text.

How to handle this is a matter of debate. The worst solution IMHO is removing all HTML because that can render clues meaningless; see how 4 Across is handled here. It may be possible to convert certain tags to sensible Latin-1 equivalents; for example the above clue could be rendered as The universe has an estimated 10^82 of them, although whether that is clearer to the average person is debatable. Yet a final option is to leave it as-is, which seems reasonable as long as HTML doesn't appear too often in clues.

Q726kbXuN commented 2 years ago

Good question. Looking over the history, here's all of the occurrences of < in a clue (the first number is the number of times total a given pattern was seen):

 546 2007-11-22 <i>See diagram</i>
  25 1995-08-22 KNO<sub>3</sub>
  23 1994-03-17 Presider over the 103<sup>rd</sup> Congress
  14 2018-06-03-variety <br />YEL
  12 2014-11-10 15, for any row, column or diagonal here:<br>
  11 2014-06-08 <s>Symbols of happiness</s> Transmissions with colons, dashes and parentheses?
  10 2016-02-19 <strong>&amp;</strong> <strong>18</strong>&nbsp;Italian-born composer
  10 2015-03-16 <span id="yui_3_17_2_4_1426084153879_1046" class="ya-q-full-text">?</span>
   8 2014-07-09 <em>Words on a birth announcement</em>
   6 2015-04-12-variety A B C D <b>&exist;</b> F G
   1 2019-08-24-mini It shares a key with @<!--EndFragment--><!--EndFragment-->
   1 2019-05-05-variety <p style="border: 1px solid black; padding: 1px;">Collection</p>

  12 2001-09-16 <--
   1 2006-03-09 :-<, in a chat room
   1 2012-06-17 With 34-Across, what "<" means
   1 2016-05-17-mini ___ than (what < means)
   1 2019-10-27 "<<" button: Abbr.

I'd suggest the following regexp rules:

"<i>(.*?)</i>" -> "_\\1_"
"<sub>(.*?)</sub>" -> "\\1"
"<sup>([0-9]+)</sup>" -> "^\\1"
"<sup>(.*?)</sup>" -> "\\1" # After the numeric <sup>
"<br( /|)>" -> " / "

And maybe run through the others and drop them (but leave the non-HTML uses of < alone). This should handle most of the useful cases. Unless there's any push back, I'll add these rules in a day or so.

jkboyce commented 2 years ago

Sounds like a good approach! It's interesting to see the html usage in puzzles across time. Great idea to separate out the numeric and non-numeric cases of <sup>.

Regarding strikethrough <s> I notice the NYT's .puz version of the June 8, 2014 puzzle (titled "Strike One") renders them as, e.g.: 23 Across: [*cross out* Symbols of happiness] Transmissions with colons, dashes and parentheses? which is maybe the most sensible way to handle it. Ignoring the markup, or swallowing the entire contents of the <s> tag, would render a less understandable clue IMHO. Granted this case is very rare!

edsantiago commented 2 years ago

LGTM! Minor suggestion, since the whitespace in <br /> is optional:

- "<br( /|)>" -> " / "
+ "<br[ /]*>" -> " / "

And, although <em> isn't equivalent to italics, I've never seen it rendered otherwise. Seems safe to add a rule similar to <i>.

Thank you for taking this on so quickly after yesterday's puzzle!

Q726kbXuN commented 2 years ago

Thanks for the feedback 40dbf7c4ea21bea4242ba48c60e8d080d7a4fbf6 should implement this behavior.