daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

Reading frame from CSV with string column in the middle #880

Closed pdamme closed 3 weeks ago

pdamme commented 3 weeks ago

The following little DaphneDSL script reads a frame from a CSV file:

example.daphne:

print(readFrame("data.csv"));

data.csv:

111,22.2,hello,333,44.4,555

data.csv.meta:

{
    "numRows": 1,
    "numCols": 6,
    "schema": [
        {"label": "a", "valueType": "si64"},
        {"label": "b", "valueType": "f64"},
        {"label": "c", "valueType": "str"},
        {"label": "d", "valueType": "si64"},
        {"label": "e", "valueType": "f64"},
        {"label": "f", "valueType": "si64"}
    ]
}

The script can be executed by bin/daphne example.daphne.

Expected output:

Frame(1x6, [a:int64_t, b:double, c:std::string, d:int64_t, e:double, f:int64_t])
111 22.2 hello 333 44.4 555

Actual output:

Frame(1x6, [a:int64_t, b:double, c:std::string, d:int64_t, e:double, f:int64_t])
111 22.2 hello 0 333 44

Possible reason:

The problem seems to be related to the return value pos of the function setCString() in src/runtime/local/io/utils.h. While the return value is used as the position of the next column in ReadCsvFile<Frame>::apply(), it's actually a position relative to the current column. The existing test cases don't seem to trigger this case, because they have string column either as the first column (where relative and absolute positions are the same) or last column (where there is anyway no next column) in a CSV file. The case is complicated by the fact that the position must start from 0 again if the string cell actually spans multiple lines.


This bug is currently preventing us from reading the lineorder table of the Star Schema Benchmark.

saminbassiri commented 3 weeks ago

Hi, I will work on this issue.

pdamme commented 3 weeks ago

Thanks! Please go ahead.