Closed GoogleCodeExporter closed 9 years ago
In HtmlBuffer.pas you can find a long list of supported character sets. Lots of
ISO char sets are supported simply by their numbers.
There are names for some russian sets as well, but it looks like they're all
synonyms for the one russian char set that Windows supports.
If you'd like to add some KOI translations, please join us at
https://github.com/BerndGabriel/HtmlViewer
We should think about a charset-"plugin" for THtmlBuffer.
OrphanCat
Original comment by OrphanCat
on 8 Feb 2012 at 8:46
OK, try to see if maybe something good will...
Original comment by SchwarzK...@yandex.ru
on 8 Feb 2012 at 9:00
Original comment by OrphanCat
on 24 Feb 2012 at 4:57
So, after some hesitation, as promised, has added support for some encodings.
Some notes and comments on the original code.
1. In the original code DOES NOT WORK conversion from ISO-2022-JP.
2. I tried to get rid of the functions of Windows MultiByteToWideChar.
3. I was not able to embed the code conversion from: 932, 943, iso-2022-jp,
iso-2022-jp-1, iso-2022-cn, iso-2022-kr.
4. For some Asian languages that use two bits per symbol, if found
missing in the language of symbol, the following characters are not decoded.
5. Used a common encoding.
6. Maybe something to add too much, something is not taken into account, was
mistaken for embedding conversion method.
If you are satisfied with this method, you can include it in HtmlViewer. You
can make changes and corrections.
P.S. I hope the translator translated correctly. :)
Original comment by SchwarzK...@yandex.ru
on 17 Apr 2012 at 5:56
Forgot to mention: unable embed code conversion from UTF-7.
Original comment by SchwarzK...@yandex.ru
on 17 Apr 2012 at 5:59
Thanks for all the hours of work on this large conversion package.
If I understand the file headers right, you converted the libiconv code to
Delphi?
For testing where can I get example files for the various codepages/charsets?
Could you please post (links to) example htmls?
Did you notice that I fixed HtmlBuffer.pas (issue 139) in the meantime? Code
pages 932, 936, 949, 950 and ISO-2022-JP are translated correctly since
revisions r257/r258 (March 15/21).
If "1. In the original code DOES NOT WORK conversion from ISO-2022-JP." means
that revision r258 of HtmlBuffer.pas still fails, please send an example html
for further testing.
Thank you again
OrphanCat
Original comment by OrphanCat
on 17 Apr 2012 at 8:36
Well, I will prepare for the test page.
I moved to Delphi only some Asian languages with libiconv (the best of
their knowledge of C + +).
I used the latest revision HtmlBuffer.pas, but the page on the ISO-2022-JP is
not able to decode.
If you have suggestions of cocoa on a different encoding - to try to help.
Original comment by SchwarzK...@yandex.ru
on 17 Apr 2012 at 8:45
I will take a week for meditations...
Original comment by SchwarzK...@yandex.ru
on 19 Apr 2012 at 11:05
Prompt.
I can not understand.
Function "function TBuffer.GetNext: Word;" shifts the position of the reader.
How do I know the next character without moving the position of reading???
Original comment by SchwarzK...@yandex.ru
on 24 Apr 2012 at 10:47
Why do you think you need it? All multibytecharsets I've seen in the past
months were build in a way, that the current byte tells you, whether you need
another byte to complete the character or not.
Original comment by OrphanCat
on 24 Apr 2012 at 10:54
Probably not necessary, think about it.
And now for verification.
Original comment by SchwarzK...@yandex.ru
on 24 Apr 2012 at 11:01
When transcoding to UTF-7, was faced with:
In UTF-7 string "<>" looks like this: "+ADwAPg-".
After conversion of the first character code refers to the "procedure
THtmlParser.GetCh;" and "function TBuffer.PeekChar: TBuffChar;" and the
compiler finds that go beyond comment. As a result, the result of conversion is
not displayed correctly (image). But if you copy everything looks correct:
<!--StartFragment--><>,.[{]} <br />
>,.[{]} <br />
ABCDEFGHIJKLMOPQRSTUVXYZ<!--EndFragment-->
As you can specify that a symbol "<" and not comment???
Something had to give extra???
Original comment by SchwarzK...@yandex.ru
on 26 Apr 2012 at 11:51
Attachments:
If I understand the differences right then "+ADwAPg-," ist translated by your
UTF-7 single character translator to the first 3 chars in the above image
11.png, while MultiByteToWideChar() used in CopyToClipboard translates
correctly.
I cannot see a chance for THtmlParser.GetCh to misunderstand a character unless
your UTF-7 extension in TBuffer.NextChar does not swallow the trailing '-'.
Notice that you might have to remember the UTF-7 state
'in-base64-encoded-block'. (Oops, the same is valid for the FJis state. I
committed the fix in r284). And actually the forgotten state could be a reason,
why TBuffer returns the '-' with the next NextChar.
However IMO image 11.png does not show the result of a detected comment, but
the result of a defect UTF-7 conversion.
Original comment by OrphanCat
on 28 Apr 2012 at 12:33
Today I will lay out his version of conversion, see what's wrong.
Original comment by SchwarzK...@yandex.ru
on 28 Apr 2012 at 12:43
So.
In my opinion this is the final version. You can use it on your own. For
myself, I'm already using. :)
Fixed:
1. Thus, as originally modules used to convert strings, I did not realize that
the sign of the end of the line in HtmlViewer is a symbol of "$0". Fixed a
problem reading the end of the character, if found missing character encoding.
2. Encoding "KOI8-T" does not have a digital equivalent. She is set to "-5".
3. Minor fixes.
Posted:
1. Added aliases known encodings.
2. Added support for some encodings.
3. Added forced recoding encodings 1250...1258.
Since the introduction in the source file "HtmlBuffer.pas" stopped working
recoding of "ISO-2022-JP", and "EUC-JP". This can be seen in the attached
examples.
As for the examples to validate the conversion. It is difficult to find a real
page, for example, "CP866". It is mainly used for these purposes "UTF-8".
To create a page in the national character set, I used the recognized library
"iconv". As a sample taken from the characters "Sample.txt" and re-encode, for
example, from Win-1251 in EUC-KR. It turns out the original page. When it
opened in HTMLViewer can judge the correctness of the conversion.
I think that's enough.
In the folder "Add" on real pages found.
Until all the... :)
If there is any need for a different encoding - to try to help.
Original comment by SchwarzK...@yandex.ru
on 28 Apr 2012 at 8:00
A slight modification.
1. Added the digital equivalent of the codepage.
2. Returned to auto-detect encoding "iso-2022-jp", added auto-detect
"iso-2022-jp-1", "iso-2022-cn" and "iso-2022-kr".
3. If you decide to use my code in HtmlViewer, later you can add rows to the
recoding function "function TBuffer.AsString: TBuffString;".
Original comment by SchwarzK...@yandex.ru
on 30 Apr 2012 at 12:48
Attachments:
When you work noticed two interesting things:
1. The method of "TBuffer.Convert" in the form WILL NOT WORK.
Function "function TBuffer.Convert" refers to "Buffer.AsString", and there is a
comparison of "if FCodePage <> FInitalCodePage then". But using this method can
not be specified explicitly FInitalCodePage. Therefore, recoding, in any case
is not correct. Button "Buffer.Convert".
2. Button "Buffer.AsString"
If you use this code, for some strange reason, WHEN DIFFERENT ENCODINGS and
WITH DIFFERENT TEXT appears different extra character at the end of the text.
It does not always happen, at varying intervals occur. The order of detection:
If at first you press the button once the symbol does not appear, you must
close the application, and repeat. This error, in my opinion, does not depend
on added my code, because it uses the original code.
Original comment by SchwarzK...@yandex.ru
on 5 May 2012 at 2:56
Attachments:
The problem described in Comment 17 appears from r277. The problem is NOT in
the file "HtmlBuffer". I think the file is to blame "StyleUn", though perhaps
complicit in this and other files to release.
Original comment by SchwarzK...@yandex.ru
on 7 May 2012 at 7:08
You made some decisions about my solutions to problems with encoding?
I lay out the fix?
Can then close the topic?
Original comment by SchwarzK...@yandex.ru
on 1 Jul 2012 at 5:24
Hi,
although I appreciate your contributions, I didn't have the time to look into
it deeply enough to adopt it to the HtmlViewer.
I must admit, I could not understand every sentence (or better: series of words
terminated by a colon) your translator emitted :(
Image 11.png attracted my attention, as the '?' instead of korean or chinese
symbols is the oldest open issue (issue 10) and I couldn't reproduce it. But
these are independently shown "extra chars", aren't they?
The "extra chars" got its own issue 162 recently.
OrphanCat
Original comment by OrphanCat
on 1 Jul 2012 at 6:11
I think I found a solution to the last more than a symbol. Tomorrow will lay
out what has changed in that time.
P.S. It is difficult to communicate through an interpreter. I try to express
their thoughts easily.
Original comment by SchwarzK...@yandex.ru
on 1 Jul 2012 at 6:23
Over the past couple of months with a new version of the module with different
encodings have not noticed any problems.
1. To use the method "TBuffer.AsString" introduces several constants.
2. The method of "TBuffer.AsString" now works with all available encodings.
3. Fixed minor bugs.
4. Fixed a problem with more than a symbol in the bottom of the page that
appears only when using the "TBuffer.AsString".
Compounding this problem is not always and only when such use. I decided to
read it, I thought that was correct. Maybe I'm wrong.
procedure CharByChar;
var
I: Integer;
begin
I := 1;
repeat
Result[I] := NextChar;
if Result[I] = #0 then
break;
Inc(I);
until false;
===> SetLength(Result, I - 1);
end;
Described in Issue 162 did not help in removing excess characters.
5. Mektod "TBuffer.Convert" does not work. Needed corrections to the underlying
code.
While my thoughts on this method to improve the conversion run out. :)
Original comment by SchwarzK...@yandex.ru
on 2 Jul 2012 at 9:06
Attachments:
Thanks for this immense work!
I will consider adding it to HtmlViewer 11.4.
I'd like to transform the huge "case FCodePage" in TBuffer.NextChar into a
bundle of classes derived from a TBuffAbstractDecoder implementing a virtual
method GetNext(Buffer: TBuffer): TBuffChar;
A member TBuffer.FDecoder: TBuffAbstractDecoder; can be initialized once in
SetCodePage and NextChar() becomes clear and short.
Please let me know, if you want to do this change.
BTW: the official HtmlBuffer.pas has changed in the meanwhile.
OrphanCat
Original comment by OrphanCat
on 2 Jul 2012 at 10:01
You can certainly try that might work. The truth of my bad encoder and need a
good understanding of the original method code. What can - help you.
In my version of "HtmlBuffer.pas" made all the changes from the initial
code, but with corrections to my data.
Original comment by SchwarzK...@yandex.ru
on 3 Jul 2012 at 6:20
Hi,
when I try to compile your latest HtmlBuffer.pas (file date: July, 1st 2012)
the compiler (and I) cannot find methods Win1250DecodeChar ..
Win1258DecodeChar. More methods are missing or not exported by
CodeChangerDecode.pas.
Could you please post a complete set of source files?
Thank you
OrphanCat
Original comment by OrphanCat
on 26 Sep 2012 at 10:50
Well, I will make changes according to the latest developments in thtmlviewer
and lay out a complete set.
Original comment by SchwarzK...@yandex.ru
on 26 Sep 2012 at 1:34
Thanks a lot.
It would be most helpful now, if you just add the missing methods to
CodeChangerDecode.pas.
Currently I'm changing HtmlBuffer.pas once again. I'm adding the above
mentioned TBuffer.FDecoder.
Later you can add the decoder class implementations. "Later" means "about a
week from now". Then you will find some examples in the new unit
BufferSubs.pas.
Thanks again
OrphanCat
Original comment by OrphanCat
on 26 Sep 2012 at 1:46
Promise.
Changes:
1. Updated as of r317.
2. I do not use Delphi above 2007. After a trial on compiling XE3 changes:
StrAlloc ==> AnsiStrAlloc.
Original comment by SchwarzK...@yandex.ru
on 27 Sep 2012 at 2:27
Attachments:
Help solve the problem related to HTMLViewer because you know better Unicode.
Can not recode string containing Cyrillic. Wanted to get the correct display of
Asian characters.
If you use the
function EUC_CNDecodeString(const S: String): WideString;
button "String" is Cyrillic (on the right) is recoded correctly, but the
Chinese re-encoded string is not correct.
If you use the
function EUC_CNDecodeString2(const S: WideString): WideString;
button "WideString" - the opposite is true.
I can not determine where to use "AnsiString", and where "WideString". :(
Can you tell how to ...
I hope to describe the problem is available.
Original comment by SchwarzK...@yandex.ru
on 27 Sep 2012 at 2:45
Attachments:
Hi,
unfortunatelly I cannot see different results. Both buttons convert the left
text to a chinese text and the right text to 'a...z' because '§?' is no legal
EUC_CN character.
BTW: the methods in CodeChangerDecode.pas are too cumbersome.
- they do not return the number of consumed characters. The caller must use
additional code to determine that number.
- they allocate from heap although a local array variable with a fixed length
would be simpler.
- they use if-else-if chains instead of case constructs.
- they use PAnsiChar and a lot of Ord()s. Using PByte and removing Ord() makes
the code easier to read.
- the methods they call often repeat the same checks the caller already has
performed to find out, which method to call.
Instead of: function xxxDecodeChar(const P: PAnsiChar): WideChar;
they should be: function xxxDecodeChar(var P: PAnsiChar): WideChar;
or even better: function xxxDecodeChar(var P: PByte): WideChar;
The second version can proceed P to the next character of the source, if
successfully consumed a character (but they should not proceed beyond the
trailing #0).
The third version is better, because it does not imply any character code like
PAnsiChar does.
FYI: Currently I'm writing some above mentioned decoder classes. I'm picking
the algorithms from CodeChangerDecode.pas, copy them to my new unit
BufferSubs.pas and optimize them.
OrphanCat
Original comment by OrphanCat
on 1 Oct 2012 at 5:52
I'm trying to make the converter lines based on your HtmlBuffer. Ironically,
when you convert the individual characters are all correct.
Where can I take BufferSubs? Neither here nor at the second site, I do not see
it.
As I said before, feel free to dispose of my modules and how they want to
optimize. My version may not be the best. :)
Original comment by SchwarzK...@yandex.ru
on 1 Oct 2012 at 6:10
In fact, the difference when you have a set of keys.
Understand further.
Original comment by SchwarzK...@yandex.ru
on 1 Oct 2012 at 6:26
Attachments:
Hi,
this is not, what I see when I am running the program.
In file unit1.dfm I see that there are unicode characters (> #255) in control
Edit11.
If you want to convert to unicode/widestring you must use AnsiString to apply
the multi byte character string.
So, you should convert Edit1 and Edit11 to simple VCL TEdit controls.
As to the cyrillic text in Edit11: What do you expect your decoder to do?
Convert from which code to WideString/Unicode? EUC_CN does not contain cyrillic
letters, thus any conversion of your current Edit22.Text to cyrillic unicode
letters is not done by EUC_CNDecodeChar(). Obviously it happens when
Edit11.Text is assigned to parameter S of method EUC_CNDecodeString(). This
conversion uses the character set of the operating system. This way your and my
results can differ as yours default is russian and mine is ansi.
EUC_CNDecodeString() "converts" the widechars in S via AnsiChar(s[i+1]), which
converts the first cyrillic character §С (= #167#1057) to §! (= #167#33). It
simply removes the high byte (#1057 = #1024 + #33 = #$0400 + #$0021). As #33 is
invalid as second byte EUC_CNDecodeChar() returns #$FFFD and the illegal letter
is skipped.
OrphanCat
Original comment by OrphanCat
on 1 Oct 2012 at 11:25
In EUC-CN is included GB2312, GB2312 and accurately contains Cyrillic. The
issue was on the other. I did not realize then that the line for conversion was
originally created in Unicode. To its right to re-encode the Chinese that the
line was just as Unicode. Thus, for the conversion to Cyrillic I used ANSI
string.
Therefore the conversion algorithm is working properly, I am afraid that will
have to make major changes in my modules. :) But to convert Unicode strings
still need to input parameter was WideString, ie
function EUC_CNDecodeString(const S: WideString): WideString;
There remains only the problem of how to determine the part of the string that
contains Wide ANSI or if there is no clear evidence defining Unicode...
Thank you.
Waiting for a new HTMLViewer. :)
Original comment by SchwarzK...@yandex.ru
on 2 Oct 2012 at 8:33
Hi,
I've committed the latest changes including new units BuffConv and
BuffConvArrays to GitHub.
OrphanCat
Original comment by OrphanCat
on 2 Oct 2012 at 4:38
I have something on this github.com not see. Downloading files from 09/25/2012.
Original comment by SchwarzK...@yandex.ru
on 2 Oct 2012 at 4:54
Did you look into branch HtmlViewer11?
Original comment by OrphanCat
on 3 Oct 2012 at 12:00
Now found a look. Cleverly organized site. :)
Original comment by SchwarzK...@yandex.ru
on 3 Oct 2012 at 8:25
As they say in Russia: there are two news, one good, the other bad. With what
to begin? :)
I'll start with the good.
You managed to significantly reduce the amount of code conversion. I basically
used the method of applying a conversion matrix, character by character
encoding to avoid using functions MultiByteToWideChar. But since you used it,
then so be it.
Now the bad.
Page (I have previously laid out for a test, add now) encoded with errors on
Win7:
922, 936, 949, big5, euc-jp, gb18030, gbk, iso-2022-cn, iso-2022-jp-1,
iso-2022-kr, iso-8859-10, iso-8859-14, iso-8859-16, koi8-t, utf-7, folder ADD
gb2312, gb2312, ks_c_5601-1987
not further encoded on WinXP without Asian fonts in the system: 858,
iso-8859-3, iso-8859-6, iso-8859-8
and I think a few more test on a different machine.
Other Languages for conversion must not be added. Some encodings are not
used in HTML, correct transcoding others I can not control it.
I write here, not on GitHub, where I somehow did not used to writing.
So as I ask errors described in Issues150, Issues186 as keep 4-3 copy of the
code and do testing somehow difficult.
Original comment by SchwarzK...@yandex.ru
on 3 Oct 2012 at 3:16
Attachments:
Thanks for testing.
The erroneous pages are not yet implemented. TBuffConvSingleByte is just a
default for not (yet) explicitly implemented pages.
I hoped you could add the missing converters. I hoped you would register at
github and fork your own repository from mine and push your additions to your
github repository.
I'm sorry, but I cannot understand the last sentence about Issues 150 and 186.
Which copies of code do you keep?
OrphanCat
Original comment by OrphanCat
on 3 Oct 2012 at 7:17
I was referring to the fact that I have kept for code that does not have the
errors described in Issues 150 and 186. I think it's version 11.2 or 11.3. Just
try to correct the error described in Issues 150 and 186.
Error pages may not be realized, but strangely different. For some reason some
pages open on Win7 and do not open on WinXP. In my opinion not working
correctly MultiByteToWideChar function from which I got rid of...
To be honest, at the moment I have very little time to program and hard to say
when it will... Just try to be aware of things of interest to me and something
to do as possible.
Original comment by SchwarzK...@yandex.ru
on 3 Oct 2012 at 7:37
Ok, I will do it.
Original comment by OrphanCat
on 3 Oct 2012 at 7:42
Hi,
units BuffConv and BuffConvArrays are complete now, I think.
All code pages seem to be okay now, incl. UTF-7.
I would be glad, I you could test it once again.
Thanks in advance.
OrphanCat
Original comment by OrphanCat
on 6 Oct 2012 at 4:45
Well, I test. The last time I thought that all the encoding have been added, so
I wrote about errors. :)
Original comment by SchwarzK...@yandex.ru
on 6 Oct 2012 at 5:29
Tested.
When you open the test pages are no known issues.
But somewhere in the file HtmlBuffer mistake or something left out.
I used the direct encoding using TBuffer. The previous version HtmlBuffer
Buffer.Convert method did not work, and the method worked remarkably
Buffer.AsString (file Project1-Old.exe). The new version of the file HtmlBuffer
recoding does not occur. When pressing "Buffer.Convert" and "Buffer.AsString"
downloading a document is only the second button, and so the same does not
happen (still file Project1-New-HTML.exe). In the file "Project1-New.exe" no
translation occurs even if you specify the encoding.
The problem described in Comment 17 until unable to reproduce, so as not
working recoding.
Original comment by SchwarzK...@yandex.ru
on 9 Oct 2012 at 4:37
Attachments:
Advanced.
Encoding "KOI-8T" is code page 20866, so the same as "KOI-8R" but some of the
characters are different from "KOI-8R". I added it at number -5. I do not know
whether it is necessary to add a new module...
Original comment by SchwarzK...@yandex.ru
on 11 Oct 2012 at 1:56
If in KOI-8R (CodePage "-5") only a few characeters differ from CodePage 20866
(KOI-8T), we can add a decoder in TBuffBaseConverter for CodePage -5, that uses
the same decoder as CodePage 20866, except for the differing characters.
Currently CodePage 20866 (KOI-8T) is converted by MultiByteToWideChar().
Can you post a "case" statement for the differing characters?
Thanks
OrphanCat
BTW: I committed a fix for the Convert()/AsString error. As you might have
noticed, I removed the TBuffer.Create(Text: AnsiString, ...) constructor. I did
it to avoid misunderstandings, because passing an "Ansi"-String implies an ANSI
code page (which one depends on charset) like a UnicodeString implies CodePage
1200.
And I added a constructor TBuffer.Create(Text: PByte, ByteCount: Integer, ...)
which is a more flexible one without implications. Please change the code in
your test application accordingly:
TBuffer.Convert(@RichEdit1.Text[1], Length(RichEdit1.Text), ...
TBuffer.Create(@RichEdit1.Text[1], Length(RichEdit1.Text), ...
Original comment by OrphanCat
on 11 Oct 2012 at 2:30
Visually, the difference can be seen in
http://ru.wikipedia.org/wiki/%CA%CE%C8-8. Obvious differences - a symbol of
2116 (B9) KOI-8T, and something else...
Original comment by SchwarzK...@yandex.ru
on 11 Oct 2012 at 2:50
I'm after a while your code to adapt the new changes in the modules and test
re-encoding "AsString" from one encoding to another encoding.
Original comment by SchwarzK...@yandex.ru
on 11 Oct 2012 at 4:13
1. Please return function
function CharSetToCodePage(ACharSet: String): Integer;
It is not used in the main code, but it is extremely useful for conversion.
2. All the same function "Convert(Text: TBuffString; CodePage: TBuffCodePage)"
does not work correctly if the input string WideString, as well as the method
of "Create(Text: TBuffString; Name: TBuffString ='')". It turns out that to
convert the input string was-would necessarily AnsiString and use the pointer.
Put a usage example, if something is not clear - ask.
Or something I can not understand. :)
Original comment by SchwarzK...@yandex.ru
on 12 Oct 2012 at 8:43
Attachments:
Original issue reported on code.google.com by
SchwarzK...@yandex.ru
on 8 Feb 2012 at 5:09