Implement `Data.read_many`

GregoryTravis commented 1 month ago

Reads a set of files into a table. The result is either one row per table, or the combined contents of all the tables.

Data.read_many
    files:(Vector File)
    format:File_Format=Auto_Detect
    include:Vector File_Attribs=..FileName|..Full_Path|Nothing
    return:Return_As=Merged_Table
    on_problems:Problem_Behavior
    -> Table

Implement File_Format.read_many: like File_Format.read but reads a set of files.

File_Format.read_many
    files:(Vector File)
    on_problems:Problem_Behavior
    -> (Vector Any) ! File_Error

jdunkerley commented 2 weeks ago

my_workbook = Data.read blah.xlsx -> read => single sheet name or address returns a table. -> read_many => Vector of names and returns either a merged table or a table of tables. read_many : Vector Text -> Headers -> Return_As -> Problem_Behavior -> Table read_many self sheet_names:Vector=self.sheet_names (headers:Headers=..Detect_Headers) (return:Return_As=..Merged_Table) (on_problems:Problem_Behavior=..Report_Warning) =

Sheet1 A1:E5 Sheet2 A1:B3

my_workbook.expand_to_rows SheetName Value Sheet1 Table Sheet2 Table

my_workbook.expand_to_rows SheetName Value Sheet1 Row ... Sheet1 Row Sheet2 Row ... Sheet2 Row

expand_to_columns SheetName A B C D E ...

FileName SheetName A B C D E June.xlsx Week1 1 3 5 3 1 June.xlsx Week2 2 4 6 4 2

my_files = ["blah.xlsx", "blah2.xlsx", ...]

my_files.map .read => Vector Data.read_many my_files ..Table_of_Tables => Table

Data.read read : Text | URI | File -> File_Format -> Problem_Behavior -> Any ! File_Error read path=(Missing_Argument.throw "path") format=Auto_Detect (on_problems : Problem_Behavior = ..Report_Warning) = case path of

Data.read_many paths:Vector Text format=Auto_Detect (return:Return_As=..Merged_Table) (on_problems:Problem_Behavior=..Report_Warning) Step 0: Decode the paths argument. Do a type class of FileManyList

Can be a Vector of File or Text (in which case lets add a FileName column).
Could be Column object of Texts (in which case take the file name from the column value and keep column in output).
Could be a Table (in which case if a single text column as above, otherwise must have a path column not case sensitive). All columns in the output.

Step 1: Read the Data in.
Naive implementation is a simple each over the paths calling Data.read.
Read each file into an object.
If a file fails to read then depending on_problems:
- Ignore: Just add Nothing
- Report_Warning: Add a warning to final result and add Nothing
- Report_Error: Throw an error and stop processing Warning should have the index of the file associated as per Vector.map.
Step 2: Merge the data. Could have a Return parser SPI which takes the FileManyList and object and merges to a result.
Merged_Table or Table_of_Tables
Table_of_Tables: Table of Somethings
- First column is the file name.
- Second column is the object.
Merged_Table: Merge all the tables into one.
- For every object in the value column, we call expand to rows and expand to columns.
- 5 JSON files & 1 text file: if all vectors then this works, if a object we don't want it to expand to rows, [{a:1,b:2},{a:3,b:4}]; 4; null; {a:1,b:2}; {a:3,b:4}; This is my text Table of a b Value a values are 1,3,null, null, 1, 3, null b values are 2,4,null, null, 2, 4, null Value values are null, null, 4, null, null, null, This is my text
- Another special case... Excel_Workbook should have support expand to rows
  - SheetName Column
  - Value Column is a set of Row objects from the expanded Sheet

Challenges:

Data is a module so having Table in it will be interesting! Method will be inside Standard.Table. How to get it in correct place?
Integration with Data.fetch_many:
- Step 2 would be used by fetch_many (and Excel_Workbook.read_many).
- Ideally we would separate the URI requests and send them in parallel using fetch_many.
Could have an addition return option as a native Vector Any.

Tasks:

Look at expanding Excel Workbooks / Sheets inside a Table. (JD)
Initial implementation would be read_many to Vector Any (i.e. steps 0 & 1) with no support for Column/Table as paths. Type class conversion of Column and Table to FileManyList (or better name!). Function should be PRIVATE
Second step to add the Table of Object and Merged Table

enso-bot[bot] commented 2 weeks ago

Radosław Waśko reports a new STANDUP for today (2024-10-31):

Progress: Implemented basic read many from vector, column or table to a vector, not tested yet but general API shape and implementation is drafted. It should be finished by 2024-11-07.

Next Day: Next day I will be working on the same task. Add tests and make sure they pass. Continue work - add new return types - simple table and merged table. Merge with excel read many. Tests.