exa-analytics / exa

The exa framework for data management, processing, and visualization
https://exa-analytics.github.io/exa
Apache License 2.0
2 stars 10 forks source link

Design: What is required to convert the Editor to be more memory friendly? #179

Closed tjduigna closed 4 years ago

tjduigna commented 4 years ago

Related to #166

Consider the following Editor use cases:

Passing the equivalent of a str to the Editor means the underlying is simply already all in-memory and could be processed as it is now.

If the Editor is provided a file on disk, the assumption should be made that it is too large to read into memory in one shot. Therefore, it makes sense to "front-load" search parameters, try to limit processing to a one-time scan through the file (processed by a generator), as well as accommodating multiple passes through the file for convenience in exploration.

avmarchenko commented 4 years ago

Yes; everything about that makes sense to me. Other things to consider here, depending on scope, might be parallel chunk reading, caching the head of the file?

tjduigna commented 4 years ago

I don't know if we want to in-house the parallelization code but if we can provide a decent abstraction layer on another library (dask or any other) we could keep that in scope.

tjduigna commented 4 years ago

Here are my new thoughts on the Editor. I think it was never really intended to be a "scalable data IO solution", and trying to fit that in to an otherwise useful API may end up limiting its usefulness for either purpose. I think we probably still need to update it slightly to fit with Data but I am starting to believe that a more scalable text file IO API belongs elsewhere.

tjduigna commented 4 years ago

Closing this issue in favor of moving in a direction where the Editor API stays more or less intact. This will need to be revisited when converting Editors to work with the new Data class.