alxnbl / onenote-md-exporter

ConsoleApp to export OneNote notebooks to Markdown formats
GNU General Public License v3.0
931 stars 75 forks source link

Images in table cells have incorrect syntax #109

Open urbite opened 3 months ago

urbite commented 3 months ago

Describe the bug An MD image has the following format, where the long hex number is a unique hash/identifier for the image file

![image5](:/a7c2fd619c1948b2a8a3f6995e72ad3c)

After exporting a OneNote notebook with several 1 row x 2 column tables, where each table cell contains a single image, the following markdown is what shows up in Joplin

| !\[image5\](:/a7c2fd619c1948b2a8a3f6995e72ad3c)<br><br>&nbsp; | &nbsp;<br><br>!\[image6\](:/7347a62442a648cda1634e07ffb66b91)<br><br>&nbsp; |
| --- | --- |

The result is that the images are not shown, but instead 'placeholder' tables with the above incorrectly formatted table text in the cells.

There are two issues with the above output

  1. The square brackets surrounding the image text description are escaped with backslashes, which AFIK, is not correct MD syntax for images. Removing this backslashes in Joplin MD from the imported OneNote notebook causes the images to be correctly displayed in the desired table format.
  2. The line breaks are not needed and cause vertical padding to be added to the bottom of the table after making the correction described above in (1) causes the images to be displayed.

There are a number of tables with images in the OneNote notebook. This is the only way I can get images to be side-by-side in OneNote. All of these table instances have the same issue when imported into Joplin.

To Reproduce

Exporting a OneNote notebook is all that is needed to reproduce this behavior.

All of my OneNotes are in a single notebook, which has a lot of proprietary information. If I need to copy the notebook and remove everything but the offending note(s) and then export, please request and I'll do so.

Expected behavior In the specific example given, it is expected that 2 images would be displayed side-by-side as they are defined in a 1 row x 2 column table. There should be no extra trailing vertical padding at the end of the table.

Logs "logs_redacted.txt" log file has been attached Any personal data export details have been replaced by [REDACTED]

Desktop (please complete the following information):

Additional context Adding screen cap of MD with tables with images. The first table is unmodified from the OneNote export. The second table has been modified/fixed per prior description. One can see that the original table MD results in a placeholder table when the content is rendered, but the copied and modified table is correctly rendered.

That's all, folks!

Original and fixed tables logs_redacted.txt

urbite commented 3 months ago

After having a look at the exported md file that was used in the above example, I found something that didn't look right. It appears that the format of the image command inside of an html cell is markdown. As I'm not an html or md person, I don't know if embedding markdown in html is legal, but my understanding is that html can be embedded in markdown.

Here's a snippet from a page of the OneNote export of a 1 row, 2 column, table with an image in each cell.

<table>
<colgroup>
<col style="width: 39%" />
<col style="width: 60%" />
</colgroup>
<thead>
<tr class="header">
<th><p>![image5](../../resources/7d9aa210e9a94bdb8da21da1f9c03bc3.jpeg)</p>
<p></p></th>
<th><p></p>
<p>![image6](../../resources/a00c9069fef1407a95e6c1b371e0103d.jpeg)</p>
<p></p></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

And the same code after tweaking to embed the images in the cells using html syntax. The table row formatting was removed, but am not certain this was needed.

<table>
<colgroup>
<col style="width: 39%" />
<col style="width: 60%" />
</colgroup>
<tr>
<td><p><img src="../../resources/7d9aa210e9a94bdb8da21da1f9c03bc3.jpeg" alt="image5"></p>
<p></p></td>
<td><p></p>
<p><img src="../../resources/a00c9069fef1407a95e6c1b371e0103d.jpeg" alt="image6"></p>
<p></p></td>
</tr>
<tbody>
</tbody>
</table>

After making the modifications, the images display correctly, side-by-side. I used Typora for the initial check. Then I imported the single modified md page into Joplin, where it also displayed correctly. Note that when importing this single page that the tables (all of them) maintain the html formatting instead of being converted to markdown as happened when the full (and unmodified) OneNote notebook was imported. This can be seen in original issue report post.

image

I noticed another nit regarding formatting after the import the modified md page. One of the headings, 'First attempt...' is not displayed correctly. This appears to be due to a missing blank line after the closing table tag, .

image

However, I noticed that this same heading is not displayed correctly when original OneNote export was imported into Joplin. So not sure if this is a Joplin import issue??? When viewed in Joplin it appears that a backslash has been inserted in front of the ### header formatting.

image

The original refrigerator repair md page from the export directory is attached, as well as the modified page. Whirlpool wrx986sihz01 refrigerator_html-mod.md Whirlpool wrx986sihz01 refrigerator.md

urbite commented 3 months ago

I now see this has been flagged in issue #48 and is a pandoc issue.

alxnbl commented 3 months ago

Thank you for the detail bug report @urbite !

Indeed, the issue is the same as #48 .

In certain situations, Padoc converts Docx tables into Html because it does't manage to translate the content into markdown. But if the html table contains an image tag, the OneNoteExport translate it into markdown.

Html tags are supported inside mardown, but markdown is not supported inside html. rmarkdown. Which thus prevent the image to display.

An evolution of the code that replace html img tags is required to avoid replacement of img tags nested into an other html tag (or at least table tag). Maybe one generous contributor will take care of the evolution, or me but not in a near futur.

urbite commented 3 months ago

Thanks for confirming that this is the same as issue #48, @alxnbl.

If I make a successful sed or awk script or 1-liner to fixup all of the images embedded in tables, I'll post it here. Had a quick try with grep but couldn't get it to match across line boundaries. But it is doable :)

alxnbl commented 3 months ago

If you master regular expression, do not hesitate to share one. But basically the idea is to update the existing one to check that there is not html tag before the img tag to replace.

alxnbl commented 3 months ago

Location in the code : ExtractImagesToResourceFolder function in file https://github.com/alxnbl/onenote-md-exporter/blob/main/src/OneNoteMdExporter/Services/Export/ExportServiceBase.cs

urbite commented 3 months ago

Here's an interim hack/solution for the images in table cells issue. It's a bash 1-liner using sed to do an inline replacement of the images in cells. Run this command in root directory of the exported OneNote notebook, before importing into Joplin.

This was done in git bash on Windows 10, but should work on any platform. The imported notebook rendered the images correctly.

find . -type f -name "*.md" -exec sed -i 's/>!\[\(.*\)\](\(.*\))</><img src="\2" alt="\1"></g' {} +

No warranties, expressed or implied. It's been minimally tested, on one notebook which had 2 pages with images in table cells. Make a backup of your exported OneNote notebook before trying this.

I'm not a .net programmer at all. I took a look at the code section that extracts images. If this issue is still open when I have lot of time I might dabble with it. But for now, the above 1-liner solves the (my!) problem.

NOTE: If you want to hack the sed regex, be aware that the () (parenthesese) have to be escaped to be used as metacharacters. In 'normal' regexes such as in grep, parenthesis have to be escaped to be used as literals. This tripped my up when setting the backreferences. This online sed regex evaluator, https://seddy.dev/ , was helpful in arriving at the final functional regex.