RamiAwar / dataline

Chat with your data - AI data analysis and visualization on CSV, Postgres, MySQL, Snowflake, SQLite...
https://dataline.app
GNU General Public License v3.0
502 stars 49 forks source link

Add support for sas7bdat files #241

Closed valentinplanes closed 1 month ago

valentinplanes commented 1 month ago

Hello team! Great project and initiative! I would like to propose you to add the support of sas7bdat files that are storing datasets in mainly SAS (or R) programming. Best

RamiAwar commented 1 month ago

Woah that was fast! @valentinplanes

Nicely done, let me quickly test it! 😍

anthony2261 commented 1 month ago

Hey @valentinplanes! Thank for submitting this. I'm looking into it, first time seeing this file type :)

I tested the changes, and my first comment is this: when we read the sas7bdat file, somehow the "column name" is different from the "label name". To explain what I mean by this, I downloaded a sample file called airline.sas7bdat from here and read it via pyreadstat

import pyreadstat
file_path = "airline.sas7bdat"
data_df, meta = pyreadstat.read_sas7bdat(file_path)
meta.column_labels

image

I get the same problem when uploading it in Dataline and asking for column tables:

image

Do you think we should use the column labels instead? The AI might have trouble querying the data if the column names are ambiguous (in this case, single letters)

RamiAwar commented 1 month ago

Good catch, I was also just testing this. The code LGTM but this is indeed a potential UX issue. I think we'd have to understand the file format better before making this decision though. Do we always have a more descriptive label?

valentinplanes commented 1 month ago

Hello, thank you for your feedback and tests!

The labels are not systematically set, but effectively they are more meaningful for sure. What we could do is by default, like suggested in pyreadstat doc :

for each column if the corresponding label is existing -> we swap the column name by the label name else we let the column name

What do you think?

(There is also the encoding of the file to take into consideration but it's also true for csv files.)

valentinplanes commented 1 month ago

I've pushed the modification of strategy, let me know :)

image