daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

834 support string value type in frame #860

Closed saminbassiri closed 1 month ago

saminbassiri commented 1 month ago

This PR closes issue #834 by adding support for reading CSV files containing String values into Frames.

Changes:

Testing:

auge commented 1 month ago

Hi @saminbassiri ,

thanks for providing this PR, reading frames with strings now works 😃 !

However, I was following up on a small example use-case using SQL on dataframes, and this crashes. I'm not sure, but I guess it's still related to string processing. Problems which are related to the SQL function can IMHO be addressed in a separate issue/PR.

data.csv and the corresponding .meta file can be taken from #834

df = readFrame("data.csv");

// works (even with slicing 😄 )
print(df[0:5,:]);

registerView("data", df);

// crashes
print(sql("SELECT data.temp FROM data WHERE data.temp>30.0; "));

// does not compile
#print(sql("SELECT data.city, data.temp FROM data WHERE data.temp>30; "));
#print(sql("SELECT data.city, MIN(data.temp), AVG(data.temp), MAX(data.temp) FROM data GROUP BY data.city ;"));

attached is the complete output of above script (I compiled your PR on my own using the daphne build container) - it starts with printing the first 5 sliced rows…:

Frame(5x2, [city:std::string, temp:double])
Algiers 3.4
St. John's 4.3
Dodoma 26.3
Toliara 17
Yellowknife 4
free(): invalid pointer
bin/daphne(+0x148b862)[0x561ff7b98862]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7f513a503320]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f513a55cb1c]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f513a50326e]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f513a4e68ff]
/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7f513a4e77b6]
/lib/x86_64-linux-gnu/libc.so.6(+0xa8fe5)[0x7f513a566fe5]
/lib/x86_64-linux-gnu/libc.so.6(+0xab37c)[0x7f513a56937c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_free+0x7e)[0x7f513a56bd9e]
/home/ubuntu/daphne/bin/../lib/libAllKernels.so(+0x7892e5)[0x7f51355212e5]
/home/ubuntu/daphne/bin/../lib/libAllKernels.so(+0x92439b)[0x7f51356bc39b]
/home/ubuntu/daphne/bin/../lib/libAllKernels.so(_decRef__Structure+0x9e)[0x7f513568664e]
[0x7f513ac312ef]
[0x7f513ac313fd]
[0x7f513ac3141d]
bin/daphne(+0x1e05215)[0x561ff8512215]
bin/daphne(+0x14ad133)[0x561ff7bba133]
bin/daphne(+0x14b25ad)[0x561ff7bbf5ad]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7f513a4e81ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7f513a4e828b]
bin/daphne(+0x148a325)[0x561ff7b97325]
[error]: Got an abort signal from the execution engine. Most likely an exception in a shared library. Check logs!
Execution error: Returning from signal 6

Also note that the last 2 lines do not compile, as daphne is not able to properly infer the string type:

[error]: Lowering pipeline error.{}
PassManager failed module lowering, responsible IR written to module_fail.log.
RewriteToCallKernelOpPass failed with the following message [ no kernel for operation `cast` available for the required input types `(!daphne.Frame<?x[?: !daphne.Unknown], ?>)` and output types `(!daphne.Matrix<?x?x!daphne.String>)` for backend `CPP`, registered kernels for this op:

when the dataframe had only numbers (instead of strings in the 1st column), the SQL function works.

KR, Benjamin

pdamme commented 1 month ago

@auge: I can reproduce the error you mention. I think it's not directly related to representing frames with string columns and reading CSV files with string columns into frames. Thus, I would merge this PR as it is. However, rest assured that we will fix that bug, because we definitely want to enable SQL queries on such frames.