Stream from stdin rather than doing `stdin.readlines()`

pjvandehaar commented 7 years ago

Currently, tabview doesn't work well when used with large or unending files. For example, cat /dev/urandom | tr -cd "fish,\n" | tabview - doesn't work.

I'd like for stdin to only be read as needed. Maybe this will also let tabview display iterators when used from Python.

(Do you know of any other commandline csv-viewer that does streaming to handle large files? I haven't found one.)

Changes that will be needed:

process_data needs to be a generator. Then view() will do data_processor = process_data(...). Viewer will do csv_data.append(next(data_processor)) when it reaches the end of csv_data.
detect_encoding() will be run on the first 1000 lines to determine enc. After those lines are exhausted, detect_encoding() will be run on each new line, updating enc if needed.
pad_data() can't happen in process_data. Viewer will run csv_data = pad_data(csv_data) if a new line from data_processor is longer than self.num_data_columns.
Viewer needs to have a few minor changes.

Forward searching will still work, and will just rapidly consume lines from data_processor.

When the user tries to sort, Viewer will do csv_data.extend(data_processor), which might take too long or possibly forever. User's problem.

Later on, it'd be fun to make mode and max column widths update as new data is read in, by storing the collections.Counter(), updating it for each new line, and updating self.column_width as needed.

The problems this introduces are:

When self.column_width_mode is max or mode, the width won't reflect rows that haven't been read yet.
Early lines could be legal as both utf8 and latin1. But maybe later lines would be illegal as utf8, meaning that the earlier ones should have been interpreted as latin1. But now we've already decoded them, looked at them, and forgotten the original binary data. Is this likely?

If you consider these drawbacks quite bad, I'd be happy with a flag --stream.

If I start work on a PR, do you have any recommendations?

wavexx commented 7 years ago

On Wed, Nov 30 2016, Peter VandeHaar wrote:

I'd like for stdin to only be read as needed. This could probably also let tabview work with iterators when used from Python, but I don't use that so I don't know.

This is hard with the current code.

(Do you know of another tool that does this? I haven't found one.)

Never found one, even though that's something I'd like as well.

You could actually cheat with a buffering program that, besides buffering, sends EOF at regular intervals (so that you could just reload the file live in tabview), but the ones that I know don't do strictly that.

The problems this introduces are:

When self.column_width_mode is max or mode, the width won't reflect rows that haven't been read yet.

This wouldn't be a problem really, if you show what's going on. Triggering a recalculation is generally more user-friendly than auto-sizing the columns randomly.

If I start work on a PR, do you have recommendations?

Godspeed? ;)

I'm not sure I fully understood the implementation details. Your plan is just to keep appending on csv_data directly as far as I understood.

In this case, I would keep an initial buffer in the generator to perform the encoding detection and padding which is unrelated to what the viewer is going.

The viewer shouldn't be concerned with any part of the reading process. Just provide it with a matrix to show. This way, as the data comes in, you can append to csv_data into chunks and update the internal state as little as possible.

If you see it the other way around, if you have a data structure you want to show in tabview, when used as a module, you'd like to skip all this process entirely.

pjvandehaar commented 7 years ago

You run tabview myhugefile.csv.
You press c to get mode-widths.
You scroll down a bunch.
You press c again.
- Now does tabview recalculate the mode width and show that?
- Or does it switch to max width, since it already showed mode-width? (the current behavior)

TabViewer / tabview

Stream from stdin rather than doing `stdin.readlines()` #138