In #78 we started discussing some incompatibilities of the current data flow and a delayed interface. I'm opening this issue to discuss how to move forward (and a proposal of what I think might make things easiest)
Current flow
Currently we infer how to read a file from the file extension, pass the file to a function and generate an in memory representation of the data. Sometimes we then dispatch this in memory representation to further function(s) for input checking and extracting the data we need and generating blocks.
Issues for creating a delayed interface
The current flow lacks separation between inference of how to extract data and DataBlock creation, usually these two things happen in one function (e.g. star file reading)
In practice, this makes it hard to implement a delayed/lazy interface for functions like this one because the parsing logic is directly coupled to the data extraction and block generation
Proposed solution
Having a defined process for how data gets from its source into DataBlock form would simplify the addition of extra features (like a lazy interface, although there may be others we haven't thought of)
my feeling is that this is best implemented as an abstract BlockReader class from which any 'data source' -> DataBlock implementations would inherit.
This class would have a defined, layered structure, what I imagine is...
data source -> in memory representation
parsing logic, defining output block type
in memory representation -> raw data required for generation of final DataBlock
any necessary transformations of raw data to data required for DataBlock creation
creation of DataBlock
This clear definition of the separate parsing 'layers' creates many 'seams' where we can implement any extra logic we might want. A good example of this is lazy loading which would be implemented at step 1. One day we might want the ability to halt after step 3 if transformations involved for going from in raw data to final representation involve expensive calculations? just an idea
When implemented, these classes will only provide references to the simple functions required for each layer. I definitely don't think each BlockReader implementation should contain all of the logic for a given file type
would you then have subclasses for each block type that implement common functionality?
to implement something like htis, it's important that we redesign functions like the star file reader so that it checks the "header" of the file in a computationally smart way, so we can retain this level of pre-checking and avoid computation waste. Unfortunately, loading star files/dynamo tables is not as trivial as loading an image with a fixed and defined header format.
I'm still not sold on the object-oriented approach and its benefits... In my a reader nevcer needs to hold a state, but I guess this is not that important
In #78 we started discussing some incompatibilities of the current data flow and a delayed interface. I'm opening this issue to discuss how to move forward (and a proposal of what I think might make things easiest)
Current flow
Currently we infer how to read a file from the file extension, pass the file to a function and generate an in memory representation of the data. Sometimes we then dispatch this in memory representation to further function(s) for input checking and extracting the data we need and generating blocks.
Issues for creating a delayed interface
The current flow lacks separation between inference of how to extract data and
DataBlock
creation, usually these two things happen in one function (e.g. star file reading)In practice, this makes it hard to implement a delayed/lazy interface for functions like this one because the parsing logic is directly coupled to the data extraction and block generation
Proposed solution
Having a defined process for how data gets from its source into
DataBlock
form would simplify the addition of extra features (like a lazy interface, although there may be others we haven't thought of)my feeling is that this is best implemented as an abstract
BlockReader
class from which any 'data source' ->DataBlock
implementations would inherit.This class would have a defined, layered structure, what I imagine is...
DataBlock
DataBlock
creationDataBlock
This clear definition of the separate parsing 'layers' creates many 'seams' where we can implement any extra logic we might want. A good example of this is lazy loading which would be implemented at step 1. One day we might want the ability to halt after step 3 if transformations involved for going from in raw data to final representation involve expensive calculations? just an idea
When implemented, these classes will only provide references to the simple functions required for each layer. I definitely don't think each
BlockReader
implementation should contain all of the logic for a given file typeLet me know what you think?