Enable non-ASCII characters in text objects

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Add a text object
2. Enter a text with non-ASCII characters (e.g. German Umlaut)

What is the expected output? What do you see instead?
Expected: Text object with "Umlaut"
Instead: Text object with black rectangle for the non-ASCII character

What version of the product are you using? On what operating system?
HeeksCAD svn:1195
HeeksCNC svn:1045
Ubuntu 10.04

Please provide any additional information below.
Entering non-ASCII characters in the Properties table of the text-object is
no problem. HeeksCAD just cannot render the non-ASCII characters.

Original issue reported on code.google.com by g.mue...@gmail.com on 6 Jun 2010 at 10:00

GoogleCodeExporter commented 8 years ago

Oops, just saw that the fonts are taken from QCad. So I guess they do not 
contain the
necessary non-ASCII characters.

So, I have to look, if I can find some fonts with german "umlaute" and see if 
this
will work.

Original comment by g.mue...@gmail.com on 6 Jun 2010 at 10:05

GoogleCodeExporter commented 8 years ago

I would be curious to know if this works too.  For the CXF fonts, the character
'name' is normally expressed as a single character between the square brackets. 
There are examples (the RomanCS font is one) where the character is specified 
as a
'#' character followed by four hexadecimal numbers.  I have assumed that these 
can be
plugged straight into the 'wxString::ToLong()' call with a base-16 parameter 
(See
line 538 of CxfFont.cpp).  I have no idea whether this correct or not.

If you can give me a specific example where a particular character is in the 
font
file but it is not presented correctly I would be happy to repair the problem.

Original comment by David.Ni...@gmail.com on 7 Jun 2010 at 2:06

GoogleCodeExporter commented 8 years ago

Hi David
Think he means "special" Characters like üöäÜÖÄ
e.g. courier.cxf line 758 the "Ö" Character

I have done a Sreenshoot from this...

This Problem exists also with Qcad free Version (2.0).
They have fixed this in comercial Version from Qcad 2.2

Hans

Original comment by aachen.h...@googlemail.com on 7 Jun 2010 at 11:41

Attachments:

Bildschirmfoto.png

GoogleCodeExporter commented 8 years ago

Hans,
 could you please add the Heeks file to this fault?  I don't know how to add text
with these special characters.
  Thanks
  David Nicholls

Original comment by David.Ni...@gmail.com on 7 Jun 2010 at 1:32

GoogleCodeExporter commented 8 years ago

David,
First what I see is the abouve Screenshoot:
Left in Properties the correkt Characters and right on Canvas the wrong... 

When I save the File and load it again it is exactly reversed ...
Left in Properties the wrong Characters and right on Canvas the correkt... 

Hans

Original comment by aachen.h...@googlemail.com on 7 Jun 2010 at 7:36

Attachments:

GoogleCodeExporter commented 8 years ago

Hans,

you are right. I was actually trying to make a door bell plate with my last 
name engraved (containing an 'ü'-Umlaut).

When saving to XML, this character is saved in ISO-Latin1 encoding (0xFC in 
this case).

Could there be a problem with Ctt() and Ttc() from strconv.cpp? It seems that 
std::wstring::push_back() does the magic of converting from char to wchar_t.

Sorry, for not digging deeper yet, but I am not into this unicode and character 
sets stuff (yet).

Guido

Original comment by g.mue...@gmail.com on 7 Jun 2010 at 9:46

GoogleCodeExporter commented 8 years ago

David, Hans,

I tried the following:
In HText::ReadFromXMLElement(TiXmlElement* pElem) I changed:
  text.assign(Ctt(a->Value()));
to
  text = wxString::From8BitData(a->Value());

After loading a *.heeks with a text object containing ISO-Latin1, this actually 
restores the original state: Properties displaying the correct string (with 
non-ASCII chars), but the empty rectangle is again displayed in the drawing 
window.

So the font lookup just needs the ISO-Latin1 encoding, but wxWidgets uses its 
own internal encoding. So there is probbly just a conversion back to ISO-Latin1 
needed before the font lookup.

It also seems to be better to use wxString::From8BitData() and To8BitData() 
instead of Ctt() and Ttc(). But this "only" converts from/to ISO-8859-1. But 
what about east european character sets?

Would it be feasible to to store the text object strings already in Unicode? Is 
the font lookup capable of addressing more than ISO-8859-1?

Guido

Original comment by g.mue...@gmail.com on 8 Jun 2010 at 9:48

GoogleCodeExporter commented 8 years ago

Guido,

  well done.  From your lead, I looked at the way we were reading the character 'names' in from the CXF font files and I was making the same mistake there.  I had written a 'ss_to_wstring' routine in the strconf.cpp file and it was also only using 7 bits rather than the full 8 bits.  When I changed that to use the wxString::From8BitData() routine as well, the whole thing seemed to fall into place.

  I had one confusion and I still don't understand it.  The TiXML classes have comments through them indicating that they use UTF8 by default and so I would have expected the Attribute() methods to return UTF8 character strings.  I don't believe they do as, when I tried the wxString::FromUTF8() call instead of wxString::From8BitData() call, it produced an empty string.

  I can live with this confusion.  The only item left on this issue is that of supporting character encoding other than ISO-8859-1.  This is certainly enough for English.  I don't know whether you want to continue to investigate other options or whether this fix is enough for your needs.

  I started looking at classes such as wxCSConv, wxFontMapper and wxLocale but I quickly became more confused.

  I hope you don't mind but I have used your find (wxString::From8Bitdata()) and implemented the fix in both the HText.cpp and CxfFont.cpp files in Subversion.

  Thanks
  David Nicholls

Original comment by David.Ni...@gmail.com on 9 Jun 2010 at 3:57

GoogleCodeExporter commented 8 years ago

David,Guido
Thx for your Work on doing this.
Working in Heekscad with this NON ASCII-Chars looks good.

The last Problem now is loading the saved .heeks File.
The Errormessage "Error reading Attributes." is displayed...

When I open the .heeks File in Editor and delete the special Chars,
I can load the File (without the special Chars)

Hans

Original comment by aachen.h...@googlemail.com on 9 Jun 2010 at 8:38

GoogleCodeExporter commented 8 years ago

Hans,

  do you mean you can't load the 'unknown.heeks' file attached to this fault?  If so I will need to keep looking at it.  I can load that file fine on my machine.  That is what I have been using for testing.

  If it is a different file, please attach it to this issue.

  Thanks
  David Nicholls

Original comment by David.Ni...@gmail.com on 9 Jun 2010 at 9:20

GoogleCodeExporter commented 8 years ago

David,
Sorry,I have forgotten this Attachment ...

The first File unknown.heeks = no Problems
The new one unknown2.heeks faults...

Hans

Original comment by aachen.h...@googlemail.com on 9 Jun 2010 at 1:32

Attachments:

unknown2.heeks

GoogleCodeExporter commented 8 years ago

David,

thanks for the quick fix! I might have found the location of the problem, but 
you can better oversee the places where the fix needs to be applied.

Hans, your report about the behavior after saving and loading actually 
triggered me to look at the right places in the code.

David, yes, for me this fix is sufficient. But we could also think of saving 
and loading the XML data as UTF8 (using ToUTF8() and FromUTF8() in HText) and 
just use To8BitData() in CxfFont.cpp.
I mean this would be a good idea in general to do any conversion between 
wxString and XML with the UTF8 conversion methods.

BTW, I tried to convert some ttf to cxf (using ttf2cxf), but the resulting 
Times_Roman.cxf could not be properly addressed. HeeksCAD just displays 
rectangles for all characters. Instead of having just the character in brackets 
(like "[a]"), the cxf generated from ttf2cxf contains something like "[#0061]". 
Well, that should better end up in seperate issue.

Guido

Original comment by g.mue...@gmail.com on 9 Jun 2010 at 7:55

GoogleCodeExporter commented 8 years ago

Guido,

  I didn't know there was any such thing as 'ttftocxf'.  I will have a look at it.  The CxfFont.cpp file does try to interpret the [#0061] format but it must not do it correctly.

  I will have another look at reading the XML data as UTF8 but my first try didn't work.  There may be some conversion in TiXML that I'm not expecting.

  Leave this fault open for these changes.

  Thanks
  David

Original comment by David.Ni...@gmail.com on 9 Jun 2010 at 9:33

GoogleCodeExporter commented 8 years ago

David,

it worked for me when changing both: HText::WriteXML() and 
HText::ReadFromXMLElement() to the UTF8 methods. OK, you need to save to UTF8 
first or create a new *.heeks file.

I was wondering if it is worth to go throught the whole code and see if we 
could eliminate the whole Ctt() and Ttc() conversion with the UTF8 methods. I 
have seen that Ctt() and Ttc() was also used for converting between filenames 
and wxStrings, etc.

But I am not sure if filenames and paths should be handled in UTF8. it also 
might be quite some work...

Guido

Original comment by g.mue...@gmail.com on 9 Jun 2010 at 10:11

GoogleCodeExporter commented 8 years ago

Guido,
  I agree that it's worth changing both HeeksCAD and HeeksCNC to use UTF-8 in all circumstances.  If we don't do this then we're just going to bump into the same problem in another place.

  I am playing with the idea of changing the definitions of Ttc() and Ctt() so that they're as shown below.  It will be quite a bit of work but I'm happy to do it.  It may just take me a day or two to get it done.

  The examples would be;

#define Ttc(s) (const char *) wxString(s).mb_str(wxConvUTF8)

inline wxString Ctt(const char String[] = "")
{
    return wxString(String, wxConvUTF8);
}

  These definitions seem to work for the Linux build but I need to make sure they also work for the Windows build.  I'm not sure whether Dan uses the Unicode builds or the Release (non-Unicode) builds when he releases the Windows version of Heeks.

  I also had a look at the ttf2cxf conversion utility.  It does an excellent job.  I can see that the character names are stored in the [#0228] format but I'm still working through how to use this value.  It's obviously a 16-bit number.  I was using a wxChar as the key into each font map but I think I need to change that to an 'unsigned long'.  That's what the character names are stored as within the ttf2cxf conversion code.  I think that means that the characters are all being stored as Unicode characters.  With our other changes we're reading and writing UTF-8 characters into and out of the wxString variables.  From my reading the wxString always stores the characters in Unicode internally.  I am expecting to be able to read the Unicode character back when I need to compare it with the font map we read in during the CXF font parsing process.

  I may have missed something but I think it will all work out alright.  It will just take me a little while.

  Thanks
  David

Original comment by David.Ni...@gmail.com on 10 Jun 2010 at 11:40

GoogleCodeExporter commented 8 years ago

David,

but be careful where to use UTF8. For the ASCII character set it is one byte, 
but it can also be multiple bytes per character (up to 4).
So if there are places where you expect a one byte value in order to do some 
lookup (like the fonts) this might fail badly.

Guido

Original comment by g.mue...@gmail.com on 10 Jun 2010 at 7:26

GoogleCodeExporter commented 8 years ago

Guido and Hans,

 I have gone ahead with this change as I believe it to be correct for both Linux and Windows.  People will either be quite happy with the change or they will be out buying pitch forks tonight.

 The error that was seen when the [#0021] form of the character 'name' was used was not due to a char versus wchar_t difference as I had expected.  It was due to the way I had constructed my CXF file parsing loop.  It was simply skipping the characters that used this naming convention.

  I can now embed characters with the umlaut and render them using a converted TrueType font (via ttf2cxf).  I'm so pleased you directed me to this utility.  I will definitely use it from now on.  I had been using the 'fonttracer' utility (I think that's what it was called).  The ttf2cxf conversion is much more convenient.

  Unless I've forgotten something, I think this solves all the problems included in this issue.  If I have missed something please let me know.

  Thanks
  David Nicholls

Original comment by David.Ni...@gmail.com on 11 Jun 2010 at 4:39

GoogleCodeExporter commented 8 years ago

Guido,
  I have added a note to the Fonts wiki so that people will know that the ttf2cxf conversion utility exists.

  Thanks
  David

Original comment by David.Ni...@gmail.com on 11 Jun 2010 at 4:55

GoogleCodeExporter commented 8 years ago

Hi David/Guido
Let's put it like this: Great Work !
For my Side it looks nice...
When you use ttf2cxf have a look at this :
http://www.ribbonsoft.com/rsforum/viewtopic.php?t=415

THX
Hans

Original comment by aachen.h...@googlemail.com on 11 Jun 2010 at 9:27

GoogleCodeExporter commented 8 years ago

David,
With the Windows build, when I try to open a STEP file or a heeks file which 
contains solids, I get "STEP import not done!". It seems to be because of the 
change to the function "Ttc". On this line:
        Standard_CString aFileName = (Standard_CString) (Ttc(filepath));
( line 709 Shape.cpp )
filepath looks correct, but aFileName seems to be corrupted.
Maybe your changes should be done just for Linux?
Dan.

Original comment by danhe...@gmail.com on 14 Jun 2010 at 4:49

GoogleCodeExporter commented 8 years ago

Dan,
  can you attach a STEP file to this issue for me to reproduce the problem with?  I would be eager to leave the change in for both Windows and Linux as it makes the language handling consistent.

  Ta
  David

Original comment by David.Ni...@gmail.com on 14 Jun 2010 at 10:20

GoogleCodeExporter commented 8 years ago

David,
I have tried it with many STEP files, including "cube.step".
You can easily make your own by creating a cube and then doing save, as 
"cube.step".
Dan.

Original comment by danhe...@gmail.com on 14 Jun 2010 at 10:31

GoogleCodeExporter commented 8 years ago

Dan,

  I think this last change fixes the code for all situations (famous last words).  I believe the problem was due to the Shape code holding a const char * pointer to what is a very temporary buffer.  It now uses the static std::string concept as per the last version except that it still supports the UTF-8 format if needed.  This should work for all languages.  It's still a little dangerous but I think that, since we're not using Ttc() twice between getting the pointer and using it, we're getting away with it.

  If you still have trouble, please let me know.

  Ta
  David

Original comment by David.Ni...@gmail.com on 15 Jun 2010 at 11:05

GoogleCodeExporter commented 8 years ago

David,

Opening a STEP file works OK now for Windows and Linux, for me, since your 
recent changes.

Thanks.

Dan.

Original comment by danhe...@gmail.com on 15 Jun 2010 at 1:31

GoogleCodeExporter commented 8 years ago

Original comment by David.Ni...@gmail.com on 15 Jun 2010 at 1:40

Changed state: Fixed

Powerino73 / heekscad

Enable non-ASCII characters in text objects #280