dashbitco / nimble_csv

A simple and fast CSV parsing and dumping library for Elixir
https://hexdocs.pm/nimble_csv
772 stars 51 forks source link

newlines failing only with Streams #75

Closed jeregrine closed 1 year ago

jeregrine commented 1 year ago

I have found a strange CSV with \r as the newline separator in the wild that fails to parse when using parse_stream, but seems to work with parse_string.

I attempted a fix but I had some trouble following the logic/macros.

(aside) While trying to narrow down a smaller csv test case I struggled. I don't know if I'm dumb at using Sed but every time I did a sed -i 's/\r/\n/g' /tmp/mprop.csv then head /tmp/mprop.csv > /tmp/mprop-smol.csv then sed -i 's/\n/\r/g' /tmp/mprop-smol.csv it would produce different file line endings than the original. Same with with elixir and String.replace. I am probably very tired and doing something wrong and dumb or maybe its a special secret \r that I cannot reproduce. (/aside)

Anyways here is the test case, the data is public and ~90mb csv so this should give you a case to work with.

Mix.install([
  {:req, "~> 0.3.6"},
  {:nimble_csv, "~> 1.2"},
])

mprop_file = "/tmp/mprop.csv"
unless File.exists?(mprop_file) do
  # public data feel free to run this
  Req.get!("https://data.milwaukee.gov/dataset/562ab824-48a5-42cd-b714-87e205e489ba/resource/0a2c7f31-cd15-4151-8222-09dd57d5f16d/download/mprop.csv", output: mprop_file)
end

NimbleCSV.define(CSV, newlines: ["\r"])

File.read!("/tmp/mprop.csv")
|> CSV.parse_string()
|> Enum.take(1) 

# Slow but succeeds.

File.stream!("/tmp/mprop.csv", read_ahead: 100_000)
|> CSV.parse_stream()
|> Enum.take(1) 

# Errors ** (NimbleCSV.ParseError) unexpected escape character " in "MAP_EXT\"\r\"0000005005\",\"\",\"2022\",\"....
josevalim commented 1 year ago

Yes, it makes sense. File.stream! is line based (and afaik not configurable).

jeregrine commented 1 year ago

@josevalim Just an FYI if I am reading the File.stream and IO.binread docs correctly I can iterate by number of bytes too and I get the same error:

File.stream!("/tmp/mprop.csv", [], 100_000)
|> CSV.parse_stream()
|> Enum.take(1) 
** (NimbleCSV.ParseError) unexpected escape character " in "MAP_EXT\"\r\"0000005005\",\"\",\"2022\",\"\",\"40\",\"\",\"2263\",\"2263\",\"\",\"N\",\"LAKE\",\"DR\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"RM4\",\"0\",\"0\",\"\",\"187000\",\"1000\",\"53202\",\"3\",\"1\",\"1\",\"\",\"C2\",\"\",\"1\",\"\",\"\",\"\",\"\",\"\",\"82932\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005010\",\"\",\"2022\",\"\",\"40\",\"\",\"9164\",\"9164\",\"\",\"N\",\"70TH\",\"ST\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"PD\",\"0\",\"0\",\"\",\"101\",\"3000\",\"53223\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"121602\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005020\",\"\",\"2022\",\"\",\"40\",\"\",\"8919\",\"8919\",\"\",\"N\",\"70TH\",\"ST\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"PD\",\"0\",\"0\",\"\",\"101\",\"3001\",\"53223\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"91341\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005030\",\"\",\"2022\",\"\",\"40\",\"\",\"9036\",\"9036\",\"\",\"N\",\"70TH\",\"ST\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"PD\",\"0\",\"0\",\"\",\"101\",\"3000\",\"53223\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"77221\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005048\",\"\",\"2022\",\"\",\"40\",\"\",\"9083\",\"9083\",\"\",\"N\",\"85TH\",\"ST\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"RM1\",\"0\",\"0\",\"\",\"101\",\"2006\",\"53224\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"108931\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005049\",\"\",\"2022\",\"\",\"40\",\"\",\"8425\",\"8425\",\"\",\"W\",\"ALLYN\",\"CT\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"RM1\",\"0\",\"0\",\"\",\"101\",\"2003\",\"53224\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"45569\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005050\",\"\",\"2022\",\"\",\"40\",\"\",\"9060\",\"9060\",\"\",\"N\",\"85TH\",\"ST\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"RM1\",\"0\",\"0\",\"\",\"101\",\"2003\",\"53224\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"115504\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005056\",\"\",\"2022\",\"\",\"40\",\"\",\"8643\",\"8643\",\"\",\"W\",\"GREENBROOK\",\"DR\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"RT1\",\"0\",\"0\",\"\",\"101\",\"2004\",\"53224\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"16581\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005060\",\"\",\"2022\",\"\",\"40\",\"\",\"9076\",\"9076\",\"\",\"N\",\"95TH\",\"ST\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"PD\",\"0\",\"0\",\"\",\"201\",\"2000\",\"53224\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"15449\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005065\",\"\",\"2022\",\"\",\"40\",\"\",\"10880\",\"10880\",\"\",\"W\",\"DONNA\",\"DR\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"0.00\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"000000000\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"N\",\"N\",\"0\",\"0.00000\",\"0\",\"\",\"0\",\"0\",\"0\",\"0\",\"0\",\"0\",\"\",\"\",\"\",\"\",\"PD\",\"0\",\"0\",\"\",\"201\",\"1010\",\"53224\",\"9\",\"4\",\"6\",\"\",\"N1\",\"\",\"0\",\"\",\"\",\"\",\"\",\"\",\"40657\",\"99999\",\"XXXX\",\"9\",\"\",\"\"\r\"0000005068\",\"\",\"2022\",\"\",\"40\",\"\",\"8674\",\"8674\",\"\",\"N\",\"SERVITE\",\"DR\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"CCA\",\"0\",\"0\",\"0\",\"\",\"0\",\"0\",\"0\",\"\",\"" <> ...
    /Users/jasonstiebs/Library/Caches/mix/installs/elixir-1.14.3-erts-13.2/9c525472ea870d261cb65ad0319c4a4d/deps/nimble_csv/lib/nimble_csv.ex:583: CSV.escape/6
    /Users/jasonstiebs/Library/Caches/mix/installs/elixir-1.14.3-erts-13.2/9c525472ea870d261cb65ad0319c4a4d/deps/nimble_csv/lib/nimble_csv.ex:453: anonymous fn/4 in CSV.parse_stream/2
    (elixir 1.14.3) lib/stream.ex:989: Stream.do_transform_user/6
    (elixir 1.14.3) lib/enum.ex:3448: Enum.take/2
josevalim commented 1 year ago

Yes, but the stream in NimbleCSV expects to receive lines separate by newlines. I don't think we can stream by "\r".

jeregrine commented 1 year ago

Sounds good, thanks for your patience with me. <3