Open Dridus opened 9 years ago
The code is merged, but perhaps this issue can stay open until we get the tests sorted out.
@acowley have you thought about piggy backing Frames on top of an existing CSV library, like cassava Cassava?
I thought about it when I started, and opted not to for a couple weak reasons: 1) I've used a few such libraries in the past, Cassava most recently, and didn't enjoy it; 2) Getting the basics working isn't much code.
I'm willing to believe that my first reason there was largely due to unfamiliarity, but the trouble I had involved coping with imperfectly formatted files. Since the functionality here is intended to compete with more commonly-used alternatives in languages that are much looser with correctness, I thought it would be useful to give myself the room to be flexible regarding inputs.
If it turns out that Cassava's performance kills the simple code here, or we encounter recurring problems of parser corner cases, then I'd be happy to delegate that functionality to a dedicated library. On the flip side, I'm also totally willing to add parser wrinkles to deal with data files people encounter even if those files are in some sense formatted incorrectly. My guiding principle here is that the data is right.
Just as a data point, I use cassava at work and haven't had any performance issues with it. In one case the file was about 2G. In another case, the data files had different numbers of columns (everything is a Double). How would frames handle the latter case?
I'm sorry, that was poor word choice. I meant that if Cassava's performance is far better than the simplistic approach taken here, it could be worth relying on it for parsing. I've not tried a 2GB file, but between streaming approaches with pipes, and unboxed vectors for columns of Double
s if you want to keep things in memory, such a file shouldn't be a problem. Let me know if you give it a shot.
I don't think I'm properly understanding the question, though. Why would different numbers of columns in different files affect anything? You'd have distinct row types for different files, and could work with functions that operate over the common columns, or functions specific to a particular row type, or class constrained functions that don't care about particular columns.
E.g. the row:
will be parsed as
["foo", "\"bar", "baz\"", "qux"]