CiscoCloud / distributive

Unit testing for the cloud
Apache License 2.0
147 stars 20 forks source link

Intensely good tabular data processing #34

Closed langston-barrett closed 9 years ago

langston-barrett commented 9 years ago

Distributive deals with a lot of tabular data. With methods like strIn, strContainedIn, reIn, commandColumnNoHeader, it's clear that this could be abstracted even further. My proposal is as follows:

We need one, unified method for splitting a tabular string into a 2D slice. It should detect which regexp to use, judging by the consistency of the row widths (it would assume even rows). It might do this using standard deviation or something. It would abstract further and eliminate the need for separateString, stringToSlice, and stringToSliceMultispace. Possible regexp's: for rows: "\n+", for columns: "\\s{2,}", "\\s+", "\t+".

In conjunction, we need a method that fetches the a column (sans header) by the header title. This should be super simple and will totally prettify the code.

For organization, this will all go into another go package: tabular.go

langston-barrett commented 9 years ago

Ideas for an algorithm for splitting arbitrary data into a table: Try different regexps, counting the length of each row. If there is one with all rows the same length, use that. If not, toss one outlier and test again. If there still isn't, just pick the one with the lowest standard deviation.

Keep in mind that this algorithm must, at its heart, be probabalistic. Which might be an issue on commands with widely varying outputs.

langston-barrett commented 9 years ago

Done with both, all that's left is to implement their use in all possible places.