albrechtf / mcf2pdf

"My CEWE Photobook" MCF to PDF converter
Other
40 stars 37 forks source link

Parsing of HTML formated texts #6

Closed haraldballuch closed 7 years ago

haraldballuch commented 8 years ago

Longer HTML texts need a lot of stack, so one gets stack overflow errors when using the standard java configuration. Here are two examples where parsing leads to wrong results. In the first case regex seems to get mixed up because of the many full stops ("."), that are used and probably because of other reasons too. In the second case there is a problem with Chinese signs ("letters"). So maybe a problem with unicode? By the way it seems to me, that the Cewe Editor itself has problems with unicode.

albrechtf commented 7 years ago

Will be fixed in next release - I improved the regular expression for extracting these paragraphs. Still, a much cleaner solution would be an HTML parser, but as it could be HTML (not XHTML, not XML), this would require another external library...

For the chinese letters - yes, confirmed, they are not displayed. MCF marks them as "Arial" font in your example, and usual Arial fonts obviously do not include chinese letters. If I try to copy them from the source of your MCF file into MS Word, I cannot select "Arial" as font for them. This will be listed as known issue for now.