HTML find and replace fix

mrchristian commented 5 months ago

We have a potential solution to the problem:

@calnfynn from HsH who is on a Praktikum with the Open Science Lab TIB - this code fix appears to solve the problem. Please add this to your Notebook code and let me know if its works.

https://gist.github.com/calnfynn/360c5f5bdcff96001336c946f6b13b59

mrchristian commented 5 months ago

Hi @calnfynn - I need you to do two write-ups on this fix. These are needed to firstly, explain what the changes are and secondly, to show how to implement the changes in full.

The coder can figure these things out when they look at ending the Notebook, but these things need to be made explicit to the user before they get into things.

As an example: I only got that the markdownify library was converting the HTML to Markdown and we weren't replacing HTML characters any more. Also the user needs to know how to add the new import and have the library loaded into Codespace.

The above is good issue fixing practice.

It only needs to be brief.

mrchristian commented 5 months ago

This is a fix for the HTML bug where some HTML characters like umlats and punctuations marks are not displaying properly and causing Quarto rendering to fail on PDF output.

This is the code with the fix: https://gist.github.com/calnfynn/360c5f5bdcff96001336c946f6b13b59

Below are the instructions about how to use the bug fix.

First you need to add a new Python library:
1. Edit requirements.txt and a line markdownify
2. in the terminal run: pip install -r requirements.txt this will install the library
Next we edit your Jupyter Notebook:
1. At the top of cell 2: Add the instruction to import the new library. Paste in from markdownify import markdownify after import html.
2. In cell 2: Next is to replace the whole of the def get_text section with the code in the following Gist: Copy all from line 3 to line 19 - https://gist.github.com/calnfynn/360c5f5bdcff96001336c946f6b13b59
You can no run your Notebook and the text should now be cleanly rendered - now as a conversion from HTML to Markdown. Before you run the Notebook use the Clear all outputs button.

NFDI4Culture / CPS-Demo

HTML find and replace fix #12