bmx-ng / text.mod

Text Utilities
0 stars 3 forks source link

Another CSV issue #9

Closed thareh closed 1 year ago

thareh commented 1 year ago

Hey,

Seems like I found another bug:

Framework BRL.Blitz
Import BRL.StandardIO
Import Text.CSV

Local stream:TStream = ReadStream("data.csv")

Local opts:TCsvOptions = New TCsvOptions()
opts.delimiter = ";"
opts.noQuotes = True

Local csv:TCsvParser = TCsvParser.Parse(stream, opts)

Local rownr:Int = 1

Repeat
    Print rownr

    Local status:ECsvStatus = csv.NextRow()

    If status <> ECsvStatus.row
        Exit
    EndIf

    Local row:TCSVRow = csv.GetRow()
    Local count:Int = row.ColumnCount()

    For Local i:Int = 0 Until count
        row.GetColumn(i).GetValue()
    Next

    rownr:+1
Forever

Print "Done!"

data.csv

col1;col2;col3;col4;col5;col6;col7;col8;col9;col10;col11;col12;col13;col14;col15;col16
245949;Ankarsrum Assistent Original 6230 Röd;Köksmaskin;Ankarsrum;Köksmaskin > 4 l;7350061086134;Assistent Original 6230R;4083,2;Pieces;10,000000;85167970;0;0;0;0;

From what I can tell it seems to be ZSV having issues again.

Thanks!

woollybah commented 1 year ago

What answer are you expecting?

Your Print rownr will always output an "extra row" because it prints before your row test. Moving it to after the If statement, and it reports the correct rows.

thareh commented 1 year ago

The rownr thingy was just something I used to find the row in the non-trimmed data.csv, so ignore that.

The program never gets to Print "Done!" for me, it crashes on row.GetColumn(i).GetValue() without any indication of what went wrong.

thareh commented 1 year ago

Seems it was because of the file being in ANSI-encoding, converted it to UTF-8 and it works fine.

woollybah commented 1 year ago

Right... It expects the input data to be UTF8 encoded.

thareh commented 1 year ago

I see, thanks for looking into it.

GWRon commented 1 year ago

Hmm should the loader have a param which defaults to some EFileEncoding.utf8

Passing a different encoding lets the loader convert to utf8 first.

Might make it more clear what input is expected.

Am 29. Januar 2023 13:22:36 MEZ schrieb Brucey @.***>:

Right... It expects the input data to be UTF8 encoded.

-- Reply to this email directly or view it on GitHub: https://github.com/bmx-ng/text.mod/issues/9#issuecomment-1407649223 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

woollybah commented 1 year ago

BlitzMax's support for different encodings isn't great, especially concerning different code pages - of which there are many.

Here's an example of the variety of available code pages : https://en.wikipedia.org/wiki/Code_page

Depending on the code page, the characters > 127 can represent different unicode code points.

GWRon commented 1 year ago

Hmm OK. Thought there would be a way to ease the pain a bit.

Some file encoding 'guesser' lib?

woollybah commented 1 year ago

If the source was Latin-1 / ISO-8859-1, which for Swedish on Windows is probably likely, you could load the file with the format set to ETextStreamFormat.LATIN1. That would probably also work.

thareh commented 1 year ago

Converting the file through Notepad++ to UTF8 works, but loading the file using ETextStreamFormat.LATIN1 does not seem to work. Any ideas on how this can be done?

woollybah commented 1 year ago

So, basically what you are asking for is a way to magically read a stream in one encoding (eg. LATIN1), and have it convert to a different encoding (eg. UTF8), before passing the data to the buffer used in the stream read request? :)

thareh commented 1 year ago

Isn't that what TTextStream is supposed to do? I tried creating a TTextStream using ETextStreamFormat.LATIN1 (which is the same as ANSI & ISO-8859-1 yes?) which should convert the file to UTF8 "on the fly" if I'm not mistaken?

woollybah commented 1 year ago

I'm guessing it's not quite as easy as that :)

thareh commented 1 year ago

Indeed, good sir! Do you think iconv would do the trick? I tried to get the BaH.libiconv module running with no success.

woollybah commented 1 year ago

Apologies for the delay. The issue is that BlitzMax doesn't have much in the way of built-in stream processing. We have, presumably, a file of code page 1252, and process that expects a UTF8 stream.

I've been working on a more general solution, but it's taking a while, so here's something I knocked together today, which appears to work on your small example :

SuperStrict

Import BRL.Stream
import brl.standardio

Type TCP1252ToUTF8Stream Extends TStreamWrapper

    Private
    Global conversionTable:Short[] = [ ..
        $0000,$0001,$0002,$0003,$0004,$0005,$0006,$0007,$0008,$0009,$000A,$000B,$000C,$000D,$000E,$000F, ..
        $0010,$0011,$0012,$0013,$0014,$0015,$0016,$0017,$0018,$0019,$001A,$001B,$001C,$001D,$001E,$001F, ..
        $0020,$0021,$0022,$0023,$0024,$0025,$0026,$0027,$0028,$0029,$002A,$002B,$002C,$002D,$002E,$002F, ..
        $0030,$0031,$0032,$0033,$0034,$0035,$0036,$0037,$0038,$0039,$003A,$003B,$003C,$003D,$003E,$003F, ..
        $0040,$0041,$0042,$0043,$0044,$0045,$0046,$0047,$0048,$0049,$004A,$004B,$004C,$004D,$004E,$004F, ..
        $0050,$0051,$0052,$0053,$0054,$0055,$0056,$0057,$0058,$0059,$005A,$005B,$005C,$005D,$005E,$005F, ..
        $0060,$0061,$0062,$0063,$0064,$0065,$0066,$0067,$0068,$0069,$006A,$006B,$006C,$006D,$006E,$006F, ..
        $0070,$0071,$0072,$0073,$0074,$0075,$0076,$0077,$0078,$0079,$007A,$007B,$007C,$007D,$007E,$007F, ..
        $20AC,$003F,$201A,$0192,$201E,$2026,$2020,$2021,$02C6,$2030,$0160,$2039,$0152,$003F,$017D,$003F, ..
        $003F,$2018,$2019,$201C,$201D,$2022,$2013,$2014,$02DC,$2122,$0161,$203A,$0153,$003F,$017E,$0178, ..
        $00A0,$00A1,$00A2,$00A3,$00A4,$00A5,$00A6,$00A7,$00A8,$00A9,$00AA,$00AB,$00AC,$00AD,$00AE,$00AF, ..
        $00B0,$00B1,$00B2,$00B3,$00B4,$00B5,$00B6,$00B7,$00B8,$00B9,$00BA,$00BB,$00BC,$00BD,$00BE,$00BF, ..
        $00C0,$00C1,$00C2,$00C3,$00C4,$00C5,$00C6,$00C7,$00C8,$00C9,$00CA,$00CB,$00CC,$00CD,$00CE,$00CF, ..
        $00D0,$00D1,$00D2,$00D3,$00D4,$00D5,$00D6,$00D7,$00D8,$00D9,$00DA,$00DB,$00DC,$00DD,$00DE,$00DF, ..
        $00E0,$00E1,$00E2,$00E3,$00E4,$00E5,$00E6,$00E7,$00E8,$00E9,$00EA,$00EB,$00EC,$00ED,$00EE,$00EF, ..
        $00F0,$00F1,$00F2,$00F3,$00F4,$00F5,$00F6,$00F7,$00F8,$00F9,$00FA,$00FB,$00FC,$00FD,$00FE,$00FF]

    Field StaticArray inBuffer:Byte[131072]
    Field offset:Int
    Field remaining:Int

    Field carry:Int
    Field hasCarry:Int

    Public

    Method New(stream:TStream)
        SetStream(stream)
    End Method

    Method Read:Long( buf:Byte Ptr,count:Long ) Override
        Local destOffset:Int

        If Not count Then
            Return 0
        End If

        If hasCarry Then
            buf[0] = carry
            destOffset :+ 1
            count :- 1
            hasCarry = False
        End If

        If remaining Then
            MemMove(inBuffer, Byte Ptr(inBuffer) + offset, Size_T(remaining))
            offset = 0
        End If

        Local toRead:Int = count
        Local available:int = inBuffer.Length - remaining
        If toRead > available Then
            toRead = available
        End If

        Local size:Int = _stream.Read(Byte Ptr(inBuffer) + remaining, toRead)
        remaining :+ size

        While count And remaining
            Local char:Int = conversionTable[inBuffer[offset]]
            offset :+ 1
            remaining :- 1

            If char < $80 Then
                buf[destOffset] = char
                destOffset :+ 1
                count :- 1
            Else
                buf[destOffset] = $C0 | char Shr 6
                destOffset :+ 1
                count :- 1
                If count Then
                    buf[destOffset] = $80 | char & $3f
                    destOffset :+ 1
                    count :- 1
                Else
                    carry = $80 | char & $3f
                    hasCarry = True
                End If
            End IF
        Wend

        Return destOffset
    End Method

End Type

You can use it like this :

Local stream:TStream = New TCP1252ToUTF8Stream(ReadStream("data.csv"))

What it does : it expects the input stream to be CP1252. It fills the buffer with a stream of UTF8 bytes.

It should work for large files (of multiple reads), and carry-over of UTF8 bytes between reads, but I haven't tested it too much, so YMMV.

thareh commented 1 year ago

Thank you so much for looking into the issue, I really appreciate the work you do!

I ended up using iconv command-line program for converting the files, but I will certainly try out the solution you posted - a built in solution is always sleeker.

Thanks again, and sorry if I'm being a nuisance with all of my bug reports and feature requests.

woollybah commented 1 year ago

I've added BRL.UTF8Stream. It allows you to wrap a stream of a given encoding with a TEncodingToUTF8Stream, producing a stream of UTF8 bytes.

So you can do something like this for a file with LATIN1 encoding :

Local stream:TStream = New TEncodingToUTF8Stream(ReadStream("data.csv"), EStreamEncoding.LATIN1)

It should be light on GC usage.

There will be a subsequent Text.Encoding module with some more supported encodings.

thareh commented 1 year ago

Spectacular! Thank you so much! :)

GWRon commented 1 year ago

Think "Brl.UTF8Stream" could be well located in text.mod then too - or brl.UTF8Stream could then be a simple "wrapper" or alias (similar to brl.pngloader)