Q726kbXuN / nytxw_puz

Turn NY Times crosswords into Across Lite files
The Unlicense
41 stars 10 forks source link

New endpoint handling broke HTML clue formatting? #22

Open tkoft opened 2 years ago

tkoft commented 2 years ago

In the puzzle json, clues with any formatting (italics is the one I see the most) have a "formatted" field in addition to "plain" under their "text" attribute in the "clues" list.

E.g. 21A from December 26, 2021:

{
  "cells":[
      36,
      37,
      38,
      39,
      40,
      41
   ],
   "direction":"Across",
   "label":"21",
   "text":[
      {
         "formatted":"<i>Malice, more formally</i>",
         "plain":"Malice, more formally"
      }
   ]
}

Only the "plain" field is ever used, even though latin1ify function seems to handle HTML tags for this purpose.

https://github.com/Q726kbXuN/nytxw_puz/commit/ac4c7a72c94c0580302b009b97c1b415da8e9ac7#diff-489afda12299c7df1e4831871e50efb4251e75dc0b31d4c662ba56f0c806ba3eR427

tkoft commented 2 years ago

Also worth noting, i found one old puzzle where the "plain" clue was uppercased for some reason. Not sure how often this occurs, but it's another reason to use the "formatted" field instead of "plain".

38A from July 11, 2019:

{
   "cells":[
      109,
      110,
      111,
      112,
      113,
      114,
      115
   ],
   "direction":"Across",
   "label":"38",
   "text":[
      {
         "formatted":"<i>Diaper</i>",
         "plain":"DIAPER"
      }
   ]
}
Q726kbXuN commented 2 years ago

The reason latin1ify handles HTML is because "plain" includes HTML a surprising number of times, even when both plain and formatted are present:

8727 2019-10-21 &#34;    &#34;Monday Night Football&#34; airer
5695 2019-10-21 &#39;    Neither&#39;s partner
 629 2007-11-22 <i>      <i>See diagram</i>
  82 2020-02-21 &amp;    Recipient of a lot of #@&amp;! money
  27 1994-03-17 <sup>    Presider over the 103<sup>rd</sup> Congress
  26 1995-08-22 <sub>    KNO<sub>3</sub>
  19 2016-05-13 &nbsp;   What 😠 &nbsp;means in a text
  12 2014-11-10 <br>     15, for any row, column or diagonal here:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;2<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;7<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;6
  11 2014-06-08 <s>      <s>Symbols of happiness</s> Transmissions with colons, dashes and parentheses?
   8 2014-07-09 <em>     <em>Words on a birth announcement</em>
   7 2015-03-16 </span>  <span id="yui_3_17_2_4_1426084153879_1046" class="ya-q-full-text">★</span>
   6 2015-04-12 <b>      A B C D <b>&exist;</b> F G
   5 2015-04-07 &deg;    90&deg; from oeste
   5 2016-02-19 <strong> <strong>&amp;</strong> <strong>18</strong>&nbsp;Italian-born composer
   3 2020-02-26 &gt;     If A&gt;B and B&gt;C, then A&gt;C, e.g.
   2 2020-08-05 &lt;     &lt;&lt;&lt; button: Abbr.
   1 2011-02-21 &rarr;   &rarr; or &larr;
   1 2015-07-12 &eacute; Cond&eacute; ___: Vogue publisher
   1 2015-12-28 &cent;   &cent;
   1 2016-05-21 &bull;   With 9-Across, [&bull;] [&bull;] at a casino
   1 2018-04-03 &rdquo;  <span style="color: #222222; font-family: Roboto, arial, sans-serif; font-size: 16px; text-indent: 0px;">&rdquo;</span>
   1 2018-08-28 &euro;   What the "&euro;" symbol stands for
   1 2019-03-12 &mdash;  The "x" in Euler's Identity &mdash; e<sup>i&pi;</sup>&nbsp;+ 1 = x
   1 2019-05-05 </p>     <p style="border: 1px solid black; padding: 1px;">Collection</p>
   1 2019-05-31 &ndash;  &ndash;
   1 2019-06-16 &zwj;    🏳️&zwj;🌈, for one
   1 2019-08-24 <!--     It shares a key with @<!--EndFragment--><!--EndFragment-->
   1 2019-10-07 &radic;  &radic;This clue's number

(The first column is the number of times a HTML fragment was seen, the second column is the date it was first seen)

I might move to using formatted if it's available, but I need to dig into the older puzzles to make sure there are no unexpected artifacts doing this, and that latin1ify can handle whatever HTML tags that are unique to formatted (assuming there are any)

Also, I should note: The unescaped emoticons in this example are present in "plain" as well, which itself is part of the fun.