Compare cell outputs symantically

I recently had to update a test that uses pytest-notebook to validate a table produced by notebook code because pandas 2.0.2 added a slight change to the white-space it produces in jupyter notebook:

While looking into this issue, i noticed that the jupyter notebook includes both a plain-text and an html representation of the cell output:

"outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Time</th>\n",
       "      <th>M1 Power Dispatch [W]</th>\n",
       "      <th>M2 Power Dispatch [W]</th>\n",
       "      <th>M3 Power Dispatch [W]</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2021-07-26 10:00:00+00:00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-2000.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2021-07-26 11:00:00+00:00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-3000.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2021-07-26 12:00:00+00:00</td>\n",
       "      <td>-4000.0</td>\n",
       "      <td>-3000.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2021-07-26 13:00:00+00:00</td>\n",
       "      <td>-4000.0</td>\n",
       "      <td>-3000.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       Time  M1 Power Dispatch [W]  M2 Power Dispatch [W]  \\\n",
       "0 2021-07-26 10:00:00+00:00                    0.0                -2000.0   \n",
       "1 2021-07-26 11:00:00+00:00                    0.0                -3000.0   \n",
       "2 2021-07-26 12:00:00+00:00                -4000.0                -3000.0   \n",
       "3 2021-07-26 13:00:00+00:00                -4000.0                -3000.0   \n",
       "\n",
       "   M3 Power Dispatch [W]  \n",
       "0                    0.0  \n",
       "1                    0.0  \n",
       "2                    0.0  \n",
       "3                    0.0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
]

This made me curious -- would it be possible to modify pytest-notebook to load the text/html contents using an xml parser like beautifulsoup and compare them symantically? It seems like it might be a way to avoid false-positives when the 'text/plain' contents change in a way that is not significant, such as changing the column spacing or white-space characters.

chrisjsewell / pytest-notebook

Compare cell outputs symantically #39