c0defellas / enzo

core utilities
11 stars 2 forks source link

add uniq tool #28

Closed geyslan closed 4 years ago

geyslan commented 7 years ago

This PR is for comparison with #27

Signed-off-by: Geyslan G. Bem geyslan@gmail.com

i4ki commented 7 years ago

@geyslan I thought we had a O(n) solution in both complexity and space once in our last discussion. No?

Depending on sort.Ints annoys me. The stable sort solution is guaranteed to be logarithmic but ... very generic and could change in the future. https://golang.org/src/sort/sort.go?s=7612:7630#L487

The O(n) solution was broken in any way? Or it does not support some of the required features?

geyslan commented 7 years ago

@tiago4orion the first solution (the simple one) that we have discussed is not broken. This PR is just for comparison with the last #27, I really do not consider this as a useful merge too. It's here just for analysis. I should have mentioned it in the #27 instead.

The one that you did refer was the first. Its time complexity is O(n) but its space depends on the input size O(2 len(n)). I know that _2 _ in the big-O should be disregarded, but the length of all inputs being duplicated seemed to me a waste of space.

That duplicate space is due to the repetition of the string in the linesMap andlinesOrdered

LinesMap: = make(map[string]int)
LinesOrdered: = []string{}

To avoid that waste (e.g. a huge input stream) in the #27 I used three maps, but see that in only one inputMap of these We have the real string. The other two lineNumsMap lineMap works as pointers to the main map. I realized that we do not need an extra ordered list if we have a map with key int (line) and value pointing to the actual string.

InputMap: = make(map[string] *string)
LineNumsMap: = make(map[*string] []int)
LineMap: = make(map[int] *string)

You can tell me if their names (maps) reflect what they represent, I do not know.

Using that approach we can fill only one map with unique strings. I still have to think how to assure the allocation of the maps in the heap.

Things need to be changed, of course, let me know what you think.

first:

package main

import (
    "bufio"
    "io"
    "os"
)

type options struct {
    printDuplicates bool
    printEmptyLines bool
    stringDelimiter byte
}

func scanLines(input *os.File, opts options) (map[string]int, []string) {
    linesMap := make(map[string]int)
    linesOrdered := []string{}
    reader := bufio.NewReader(input)

    for {
        line, err := reader.ReadBytes(opts.stringDelimiter)
        if err == io.EOF {
            break
        }
        lineStr := string(line)
        linesOrdered = append(linesOrdered, lineStr)
        linesMap[lineStr]++
    }

    return linesMap, linesOrdered
}

func writeStdout(line string) {
    _, err := os.Stdout.WriteString(line)
    if err != nil {
        // what to do?
        os.Exit(1)
    }
}

func printLines(linesMap map[string]int, linesOrdered []string, opts options) {
    for _, line := range linesOrdered {
        if linesMap[line] == 0 {
            continue
        }
        if opts.printEmptyLines && line == string(opts.stringDelimiter) {
            writeStdout(line)
            continue
        }
        if opts.printDuplicates {
            if linesMap[line] < 2 {
                goto discardAndContinue
            }
        } else if linesMap[line] != 1 {
            goto discardAndContinue
        }
        writeStdout(line)
    discardAndContinue:
        linesMap[line] = 0
    }
}

func parseArgs(args []string) options {
    var opts options
    opts.stringDelimiter = byte('\n')

    for _, opt := range args[1:] {
        switch opt {
        case "-dup":
            opts.printDuplicates = true
        case "-empty":
            opts.printEmptyLines = true
        case "-nul":
            opts.stringDelimiter = 0x00
        default:
            writeStdout("Usage:\n")
            writeStdout("uniq [-dup | -empty | -nul]\n")
            os.Exit(1)
        }
    }
    return opts
}

func main() {
    opts := parseArgs(os.Args)
    linesMap, linesArr := scanLines(os.Stdin, opts)
    printLines(linesMap, linesArr, opts)
}
i4ki commented 7 years ago

@geyslan Regarding the O(n) implementation, I think we can change it to use string pointers in the linesOrdered array also, to avoid waste the double of memory. Using three maps instead of ordered slice requires sorting and complicates the logic.

I propose a third PR with the O(n) implementation and we can evaluate run time and memory usage.

Take a look in the rules of the Complexity section of Rob Pike C style: http://doc.cat-v.org/bell_labs/pikestyle

geyslan commented 7 years ago

@tiago4orion did you notice that using 3 maps we don't need to sort anything? The catch is to iterate from 1 until the line count and test if this number exists in the lineMap, using i as the key.

Take a look: https://github.com/c0defellas/enzo/pull/27/commits/06cc033e4ce7b554750a92f2fd541ee4c745a875#diff-d2c7a5b53893853781f228b0716c857fR68

Concerning your suggestion to use a slice with pointer to string in the map, I think it can be done, but using a structure to count occurrences.

type struct Line {
    Text *string
    Count int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line

The inputMap is necessary only when scanning which lets the results ordered in the linesOrdered for later use.

The behavior (like GNU uniq) is not expected indeed. It was intentional. Take a look at https://github.com/c0defellas/enzo/issues/29#issuecomment-270253150.

About the new PR, NP. Just waiting more reviews/discussions.

geyslan commented 7 years ago

For keep track of the line numbers we could use:

type struct Line {
    Text *string
    Numbers []int
}
geyslan commented 7 years ago

Please, let's discuss this on https://github.com/c0defellas/enzo/issues/29 before any changes.

geyslan commented 7 years ago

Hello folks. :smiley:

I did a rebase and push some changes. If it's ok there's still a conundrum to solve https://github.com/c0defellas/enzo/issues/29#issuecomment-274259610 before merging.

i4ki commented 7 years ago

@geyslan Awesome =)

I'm very busy right now, sorry! But I'll try to take a look soon (maybe tomorrow).

i4ki commented 7 years ago

I have a real use case for uniq tomorrow. I'll give feedback soon.

:)