attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.44k stars 266 forks source link

csv importer fails with FAA data #1049

Open aboodman opened 8 years ago

aboodman commented 8 years ago
importer!? ./importer -ds="test" -ldb="/tmp/testxx" -header="A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,AB,AC,AD,AE,AF,AG,AH,AI,AJ,AK,AL,AM,AN,AO,AP,AQ,AR,AS,AT,AU,AV,AW,AX,AY,AZ,BA,BB,BC,BD,BE,BF,BG,BH,BI,BJ,BK,BL,BM,BN,BO,BP,BQ,BR,BS,BT,BU,BV,BW,BX,BY,BZ,CA,CB,CC,CD,CE,CF,CG,CH,CI,CJ,CK,CL,CM,CN,CO,CP,CQ,CR,CS,CT,CU,CV,CW,CX,CY,CZ,DA,DB,DC,DD,DE,DF,DG,DH,DI,DJ,DK,DL,DM,DN,DO,DP,DQ,DR,DS,DT,DU,DV,DW,DX,DY,DZ,EA,EB,EC,ED,EE,EF,EG,EH,EI,EJ,EK,EL,EM,EN,EO,EP,EQ,ER,ES,ET,EU,EV,EW,EX,EY,EZ,FA,FB,FC,FD,FE,FF,FG,FH,FI,FJ,FK,FL,FM,FN,FO,FP,FQ,FR,FS,FT,FU,FV,FW,FX" "/Users/aa/Downloads/Accidents 2005-2009.csv"
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/attic-labs/noms/clients/csv.Read(0xc820015040, 0x601ea8, 0x3, 0xc8200c0000, 0xb4, 0xb4, 0xc820119c7c, 0x0, 0x0, 0x18801a0, ...)
    /Users/aa/src/github.com/attic-labs/noms/clients/csv/read.go:125 +0xa9d
main.main()
    /Users/aa/src/github.com/attic-labs/noms/clients/csv/importer/importer.go:93 +0xdee

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 24 [chan receive]:
github.com/attic-labs/noms/types.NewStreamingTypedList.func1(0x0, 0x0, 0x0, 0x0, 0xeb4980, 0xc8200ccc00, 0xc8200ccc20, 0x18801a0, 0xc82110a140, 0xc8200c81e0, ...)
    /Users/aa/src/github.com/attic-labs/noms/types/list.go:50 +0x26f
created by github.com/attic-labs/noms/types.NewStreamingTypedList
    /Users/aa/src/github.com/attic-labs/noms/types/list.go:55 +0xd6
aboodman commented 8 years ago

@kyleder's going to look at this!

kyleder commented 8 years ago

One of the issues here appears to be that the CSV (which was generated by Mac Excel) uses carriage returns instead of line feeds for the line terminator. From what I've read, Go really dislikes CR's so once I have some time tonight I'll look for an efficient way to convert the CR's to LF's. Another user complained about the same issue here which seems like a good place to start.

I believe that there is an additional issue with unbalanced quotations in some of the fields that is breaking the parser, but I want to make sure that the CR issue is resolved before tackling that.

aboodman commented 8 years ago

Oh yeah, we saw that on a different file, here: https://github.com/attic-labs/attic/issues/446.

@arv and @cmasone-attic looked at that package, but they did not like it because it blindly converts all CR to CRLF. Which means that files that have CRLF (which is the standard) will end up as LFLF, which is also going to break things.

We need a converter that changes CR into LF only when it appears on its own, not part of CRLF.

aboodman commented 8 years ago

I think that right thing to do here is to implement an io.Reader that wraps bufio.Reader and does the conversion. Then the CSV parser can be changed to take the CR-killing reader as input rather than a standard reader.

kyleder commented 8 years ago

Regarding macreader blindly replacing CR's, I think that that's OK since the CSV reader will ignore blank lines. But I think that it's irrelevant anyways because I extended the examples that ctessum provided to include one with CRLF's and the result was identical:

// testFile2 is a CSV file with CRLF line endings.
testFile2 := bytes.NewBufferString("a,b,c\r\n1,2,3\r\n").Bytes()

r3 := csv.NewReader(New(bytes.NewReader(testFile2)))
lines3, err := r3.ReadAll()
fmt.Printf("With macreader (CRLF): %#v\n", lines3)

if !reflect.DeepEqual(lines2, lines3) {
    t.Error("Expected CR result to match CRLF result")
}

I don't know enough about how this works to say with any certainty that this is acceptable, but on the surface it seems that way.

I'm going to try integrating macreader with noms to see if it fixes the CR issue and I'll report back.

arv commented 8 years ago

One issue with using macreader as is, is that it will replace \r\n inside quoted fields with \n\n which seems bad to me.

kyleder commented 8 years ago

@arv That's a good point.

I submitted a pull request to macreader for a change that ignores CRLF's (which seem to behave in Go). I verified that the change doesn't break quoted CRLF's.

Fixing the CR issue fixes the FAA data import which is working locally now with my fork of macreader.

aboodman commented 8 years ago

That's cool that the FAA data works with the changed macreader, and cool to fix the bug there. I think there is an issue with your patch to macreader, but I left it over there.