brisvag / blik

Python tool for visualising and interacting with cryo-ET and subtomogram averaging data.
https://brisvag.github.io/blik/
GNU General Public License v3.0
23 stars 8 forks source link

Data parsing flow/architecture #79

Closed alisterburt closed 3 years ago

alisterburt commented 3 years ago

In #78 we started discussing some incompatibilities of the current data flow and a delayed interface. I'm opening this issue to discuss how to move forward (and a proposal of what I think might make things easiest)

Current flow

Currently we infer how to read a file from the file extension, pass the file to a function and generate an in memory representation of the data. Sometimes we then dispatch this in memory representation to further function(s) for input checking and extracting the data we need and generating blocks.

 Issues for creating a delayed interface

The current flow lacks separation between inference of how to extract data and DataBlock creation, usually these two things happen in one function (e.g. star file reading)

In practice, this makes it hard to implement a delayed/lazy interface for functions like this one because the parsing logic is directly coupled to the data extraction and block generation

Proposed solution

Having a defined process for how data gets from its source into DataBlock form would simplify the addition of extra features (like a lazy interface, although there may be others we haven't thought of)

my feeling is that this is best implemented as an abstract BlockReader class from which any 'data source' -> DataBlock implementations would inherit.

This class would have a defined, layered structure, what I imagine is...

  1. data source -> in memory representation
  2. parsing logic, defining output block type
  3. in memory representation -> raw data required for generation of final DataBlock
  4. any necessary transformations of raw data to data required for DataBlock creation
  5. creation of DataBlock

This clear definition of the separate parsing 'layers' creates many 'seams' where we can implement any extra logic we might want. A good example of this is lazy loading which would be implemented at step 1. One day we might want the ability to halt after step 3 if transformations involved for going from in raw data to final representation involve expensive calculations? just an idea

When implemented, these classes will only provide references to the simple functions required for each layer. I definitely don't think each BlockReader implementation should contain all of the logic for a given file type

Let me know what you think?

brisvag commented 3 years ago

Some points/questions:

As a general idea, I agree with you!

brisvag commented 3 years ago

Closed with #84.