bokuweb / docx-rs

:memo: A .docx file writer with Rust/WebAssembly.
https://bokuweb.github.io/docx-rs/
MIT License
342 stars 61 forks source link

Special characters are being carried across using HTML entities vs unicode #554

Closed gjblajian closed 1 year ago

gjblajian commented 1 year ago

Describe the bug

html encoded values vs unicode values we have seen include ', " and & are coming across in the text elements

Reproduced step

Steps to reproduce the behavior:

import { readDocx } from 'docx-wasm' const parsedDoc = readDocx(buf) console.log(parsedDoc:, parsedDoc)

Expected behavior

would prefer to see the values in the output as unicode e.g. ← since many special characters do not actually have html entity translations (MS Word's start and end double quotes are different unicode entities [U+201C, U+201D])

Actual behavior

html encoded values ', " and &

Screenshots

image

image

Corporate Arbitration.docx

Desktop (please complete the following information)

gjblajian commented 1 year ago

Note that this bug does not ALWAYS happen for the quot or apos but seems to happen consistently for amp.

bokuweb commented 1 year ago

I'll check it later.

bokuweb commented 1 year ago

@gjblajian Please try 0.0.276-rc33

gjblajian commented 1 year ago

thank you, @bokuweb