Closed thareh closed 1 year ago
What answer are you expecting?
Your Print rownr
will always output an "extra row" because it prints before your row test.
Moving it to after the If statement, and it reports the correct rows.
The rownr thingy was just something I used to find the row in the non-trimmed data.csv, so ignore that.
The program never gets to Print "Done!"
for me, it crashes on row.GetColumn(i).GetValue()
without any indication of what went wrong.
Seems it was because of the file being in ANSI-encoding, converted it to UTF-8 and it works fine.
Right... It expects the input data to be UTF8 encoded.
I see, thanks for looking into it.
Hmm should the loader have a param which defaults to some EFileEncoding.utf8
Passing a different encoding lets the loader convert to utf8 first.
Might make it more clear what input is expected.
Am 29. Januar 2023 13:22:36 MEZ schrieb Brucey @.***>:
Right... It expects the input data to be UTF8 encoded.
-- Reply to this email directly or view it on GitHub: https://github.com/bmx-ng/text.mod/issues/9#issuecomment-1407649223 You are receiving this because you are subscribed to this thread.
Message ID: @.***>
BlitzMax's support for different encodings isn't great, especially concerning different code pages - of which there are many.
Here's an example of the variety of available code pages : https://en.wikipedia.org/wiki/Code_page
Depending on the code page, the characters > 127 can represent different unicode code points.
Hmm OK. Thought there would be a way to ease the pain a bit.
Some file encoding 'guesser' lib?
If the source was Latin-1
/ ISO-8859-1
, which for Swedish on Windows is probably likely, you could load the file with the format set to ETextStreamFormat.LATIN1
. That would probably also work.
Converting the file through Notepad++ to UTF8 works, but loading the file using ETextStreamFormat.LATIN1 does not seem to work. Any ideas on how this can be done?
So, basically what you are asking for is a way to magically read a stream in one encoding (eg. LATIN1), and have it convert to a different encoding (eg. UTF8), before passing the data to the buffer used in the stream read request? :)
Isn't that what TTextStream is supposed to do? I tried creating a TTextStream using ETextStreamFormat.LATIN1 (which is the same as ANSI & ISO-8859-1 yes?) which should convert the file to UTF8 "on the fly" if I'm not mistaken?
I'm guessing it's not quite as easy as that :)
Indeed, good sir! Do you think iconv would do the trick? I tried to get the BaH.libiconv module running with no success.
Apologies for the delay. The issue is that BlitzMax doesn't have much in the way of built-in stream processing. We have, presumably, a file of code page 1252, and process that expects a UTF8 stream.
I've been working on a more general solution, but it's taking a while, so here's something I knocked together today, which appears to work on your small example :
SuperStrict
Import BRL.Stream
import brl.standardio
Type TCP1252ToUTF8Stream Extends TStreamWrapper
Private
Global conversionTable:Short[] = [ ..
$0000,$0001,$0002,$0003,$0004,$0005,$0006,$0007,$0008,$0009,$000A,$000B,$000C,$000D,$000E,$000F, ..
$0010,$0011,$0012,$0013,$0014,$0015,$0016,$0017,$0018,$0019,$001A,$001B,$001C,$001D,$001E,$001F, ..
$0020,$0021,$0022,$0023,$0024,$0025,$0026,$0027,$0028,$0029,$002A,$002B,$002C,$002D,$002E,$002F, ..
$0030,$0031,$0032,$0033,$0034,$0035,$0036,$0037,$0038,$0039,$003A,$003B,$003C,$003D,$003E,$003F, ..
$0040,$0041,$0042,$0043,$0044,$0045,$0046,$0047,$0048,$0049,$004A,$004B,$004C,$004D,$004E,$004F, ..
$0050,$0051,$0052,$0053,$0054,$0055,$0056,$0057,$0058,$0059,$005A,$005B,$005C,$005D,$005E,$005F, ..
$0060,$0061,$0062,$0063,$0064,$0065,$0066,$0067,$0068,$0069,$006A,$006B,$006C,$006D,$006E,$006F, ..
$0070,$0071,$0072,$0073,$0074,$0075,$0076,$0077,$0078,$0079,$007A,$007B,$007C,$007D,$007E,$007F, ..
$20AC,$003F,$201A,$0192,$201E,$2026,$2020,$2021,$02C6,$2030,$0160,$2039,$0152,$003F,$017D,$003F, ..
$003F,$2018,$2019,$201C,$201D,$2022,$2013,$2014,$02DC,$2122,$0161,$203A,$0153,$003F,$017E,$0178, ..
$00A0,$00A1,$00A2,$00A3,$00A4,$00A5,$00A6,$00A7,$00A8,$00A9,$00AA,$00AB,$00AC,$00AD,$00AE,$00AF, ..
$00B0,$00B1,$00B2,$00B3,$00B4,$00B5,$00B6,$00B7,$00B8,$00B9,$00BA,$00BB,$00BC,$00BD,$00BE,$00BF, ..
$00C0,$00C1,$00C2,$00C3,$00C4,$00C5,$00C6,$00C7,$00C8,$00C9,$00CA,$00CB,$00CC,$00CD,$00CE,$00CF, ..
$00D0,$00D1,$00D2,$00D3,$00D4,$00D5,$00D6,$00D7,$00D8,$00D9,$00DA,$00DB,$00DC,$00DD,$00DE,$00DF, ..
$00E0,$00E1,$00E2,$00E3,$00E4,$00E5,$00E6,$00E7,$00E8,$00E9,$00EA,$00EB,$00EC,$00ED,$00EE,$00EF, ..
$00F0,$00F1,$00F2,$00F3,$00F4,$00F5,$00F6,$00F7,$00F8,$00F9,$00FA,$00FB,$00FC,$00FD,$00FE,$00FF]
Field StaticArray inBuffer:Byte[131072]
Field offset:Int
Field remaining:Int
Field carry:Int
Field hasCarry:Int
Public
Method New(stream:TStream)
SetStream(stream)
End Method
Method Read:Long( buf:Byte Ptr,count:Long ) Override
Local destOffset:Int
If Not count Then
Return 0
End If
If hasCarry Then
buf[0] = carry
destOffset :+ 1
count :- 1
hasCarry = False
End If
If remaining Then
MemMove(inBuffer, Byte Ptr(inBuffer) + offset, Size_T(remaining))
offset = 0
End If
Local toRead:Int = count
Local available:int = inBuffer.Length - remaining
If toRead > available Then
toRead = available
End If
Local size:Int = _stream.Read(Byte Ptr(inBuffer) + remaining, toRead)
remaining :+ size
While count And remaining
Local char:Int = conversionTable[inBuffer[offset]]
offset :+ 1
remaining :- 1
If char < $80 Then
buf[destOffset] = char
destOffset :+ 1
count :- 1
Else
buf[destOffset] = $C0 | char Shr 6
destOffset :+ 1
count :- 1
If count Then
buf[destOffset] = $80 | char & $3f
destOffset :+ 1
count :- 1
Else
carry = $80 | char & $3f
hasCarry = True
End If
End IF
Wend
Return destOffset
End Method
End Type
You can use it like this :
Local stream:TStream = New TCP1252ToUTF8Stream(ReadStream("data.csv"))
What it does : it expects the input stream to be CP1252. It fills the buffer with a stream of UTF8 bytes.
It should work for large files (of multiple reads), and carry-over of UTF8 bytes between reads, but I haven't tested it too much, so YMMV.
Thank you so much for looking into the issue, I really appreciate the work you do!
I ended up using iconv command-line program for converting the files, but I will certainly try out the solution you posted - a built in solution is always sleeker.
Thanks again, and sorry if I'm being a nuisance with all of my bug reports and feature requests.
I've added BRL.UTF8Stream. It allows you to wrap a stream of a given encoding with a TEncodingToUTF8Stream, producing a stream of UTF8 bytes.
So you can do something like this for a file with LATIN1 encoding :
Local stream:TStream = New TEncodingToUTF8Stream(ReadStream("data.csv"), EStreamEncoding.LATIN1)
It should be light on GC usage.
There will be a subsequent Text.Encoding module with some more supported encodings.
Spectacular! Thank you so much! :)
Think "Brl.UTF8Stream" could be well located in text.mod then too - or brl.UTF8Stream could then be a simple "wrapper" or alias (similar to brl.pngloader)
Hey,
Seems like I found another bug:
data.csv
From what I can tell it seems to be ZSV having issues again.
Thanks!