enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.38k stars 323 forks source link

Implement `Data.read_many` #11311

Open GregoryTravis opened 1 month ago

GregoryTravis commented 1 month ago

Reads a set of files into a table. The result is either one row per table, or the combined contents of all the tables.

Data.read_many
    files:(Vector File)
    format:File_Format=Auto_Detect
    include:Vector File_Attribs=..FileName|..Full_Path|Nothing
    return:Return_As=Merged_Table
    on_problems:Problem_Behavior
    -> Table

Implement File_Format.read_many: like File_Format.read but reads a set of files.

File_Format.read_many
    files:(Vector File)
    on_problems:Problem_Behavior
    -> (Vector Any) ! File_Error
jdunkerley commented 2 weeks ago

my_workbook = Data.read blah.xlsx -> read => single sheet name or address returns a table. -> read_many => Vector of names and returns either a merged table or a table of tables. read_many : Vector Text -> Headers -> Return_As -> Problem_Behavior -> Table read_many self sheet_names:Vector=self.sheet_names (headers:Headers=..Detect_Headers) (return:Return_As=..Merged_Table) (on_problems:Problem_Behavior=..Report_Warning) =

Sheet1 A1:E5 Sheet2 A1:B3

my_workbook.expand_to_rows SheetName Value Sheet1 Table Sheet2 Table

my_workbook.expand_to_rows SheetName Value Sheet1 Row ... Sheet1 Row Sheet2 Row ... Sheet2 Row

expand_to_columns SheetName A B C D E ...

FileName SheetName A B C D E June.xlsx Week1 1 3 5 3 1 June.xlsx Week2 2 4 6 4 2

my_files = ["blah.xlsx", "blah2.xlsx", ...]

my_files.map .read => Vector Data.read_many my_files ..Table_of_Tables => Table

Data.read read : Text | URI | File -> File_Format -> Problem_Behavior -> Any ! File_Error read path=(Missing_Argument.throw "path") format=Auto_Detect (on_problems : Problem_Behavior = ..Report_Warning) = case path of

Data.read_many paths:Vector Text format=Auto_Detect (return:Return_As=..Merged_Table) (on_problems:Problem_Behavior=..Report_Warning) Step 0: Decode the paths argument. Do a type class of FileManyList

Challenges:

Tasks:

enso-bot[bot] commented 2 weeks ago

Radosław Waśko reports a new STANDUP for today (2024-10-31):

Progress: Implemented basic read many from vector, column or table to a vector, not tested yet but general API shape and implementation is drafted. It should be finished by 2024-11-07.

Next Day: Next day I will be working on the same task. Add tests and make sure they pass. Continue work - add new return types - simple table and merged table. Merge with excel read many. Tests.

enso-bot[bot] commented 1 week ago

Radosław Waśko reports a new STANDUP for yesterday (2024-11-04):

Progress: Added tests and fixed edge cases. Created 1st PR. It should be finished by 2024-11-07.

Next Day: Next day I will be working on the same task. Continue work - add new return types - simple table and merged table. Merge with excel read many. More tests.

enso-bot[bot] commented 1 week ago

Radosław Waśko reports a new STANDUP for yesterday (2024-11-05):

Progress: Adding return as table, defaults dependent on input type, basic logic for merging tables. It should be finished by 2024-11-07.

Next Day: Next day I will be working on the same task. Add tests, merge with excel read many.

enso-bot[bot] commented 1 week ago

Radosław Waśko reports a new STANDUP for yesterday (2024-11-06):

Progress: Adding more edge case tests, discussing edge case behaviour. Fixes to make the logic actually work. It should be finished by 2024-11-07.

Next Day: Next day I will be working on the same task. Continue fixes, think how to merge with Excel. Align expand_ methods.

enso-bot[bot] commented 1 week ago

Radosław Waśko reports a new STANDUP for yesterday (2024-11-07):

Progress: Fixing tests, edge cases for JS_Object. It should be finished by 2024-11-08.

Next Day: Next day I will be working on the same task. Test for XLS. Align expand_ methods.

enso-bot[bot] commented 1 week ago

Radosław Waśko reports a new STANDUP for today (2024-11-08):

Progress: Reviewing PRs (looking at Number Parser and trying to understand it, so took some time). Trying out Excel read many in practice. Fixing widgets for Data.read_many. Starting work on merging the logic with Excel read many. It should be finished by 2024-11-13.

Next Day: Next day I will be working on the same task. Merge with excel read many. For now without XLS test, just keep logic unified. Add a test for Data.read_many for Match_Columns and Columns_To_Keep. Prepare PR. Create tickets for expanding sheets in Data.read_many, add ticket for aligning types in expand_* methods.

enso-bot[bot] commented 4 days ago

Radosław Waśko reports a new STANDUP for yesterday (2024-11-12):

Progress: Added test cases for reading many excel files with many sheets, cases for merging types and some other edge cases (0 element arrays). Working on union logic. It should be finished by 2024-11-13.

Next Day: Next day I will be working on the same task. Finish the new logic to get the tests to pass

enso-bot[bot] commented 3 days ago

Radosław Waśko reports a new STANDUP for yesterday (2024-11-13):

Progress: Adding more edge cases. Found a problem with Union logic in case of all-null columns, discussed to prioritize Value_Type.Null ticket. Fixed the empty table edge case. It should be finished by 2024-11-14.

Next Day: Next day I will be working on the same task. Fix empty array edge case and prepare a PR.

enso-bot[bot] commented 2 days ago

Radosław Waśko reports a new STANDUP for yesterday (2024-11-14):

Progress: Discussed empty table/array edge case, updated tests. Added edge case tests for weird shaped files. Put up the PR. It should be finished by 2024-11-15.

Next Day: Next day I will be working on the #6281 task. Fix last failing test, start new task.