Closed geyslan closed 4 years ago
@geyslan I thought we had a O(n) solution in both complexity and space once in our last discussion. No?
Depending on sort.Ints annoys me. The stable sort solution is guaranteed to be logarithmic but ... very generic and could change in the future. https://golang.org/src/sort/sort.go?s=7612:7630#L487
The O(n) solution was broken in any way? Or it does not support some of the required features?
@tiago4orion the first solution (the simple one) that we have discussed is not broken. This PR is just for comparison with the last #27, I really do not consider this as a useful merge too. It's here just for analysis. I should have mentioned it in the #27 instead.
The one that you did refer was the first. Its time complexity is O(n) but its space depends on the input size O(2 len(n)). I know that _2 _ in the big-O should be disregarded, but the length of all inputs being duplicated seemed to me a waste of space.
That duplicate space is due to the repetition of the string in the linesMap
andlinesOrdered
LinesMap: = make(map[string]int)
LinesOrdered: = []string{}
To avoid that waste (e.g. a huge input stream) in the #27 I used three maps, but see that in only one inputMap
of these We have the real string. The other two lineNumsMap
lineMap
works as pointers to the main map. I realized that we do not need an extra ordered list if we have a map with key int (line) and value pointing to the actual string.
InputMap: = make(map[string] *string)
LineNumsMap: = make(map[*string] []int)
LineMap: = make(map[int] *string)
You can tell me if their names (maps) reflect what they represent, I do not know.
Using that approach we can fill only one map with unique strings. I still have to think how to assure the allocation of the maps in the heap.
Things need to be changed, of course, let me know what you think.
package main
import (
"bufio"
"io"
"os"
)
type options struct {
printDuplicates bool
printEmptyLines bool
stringDelimiter byte
}
func scanLines(input *os.File, opts options) (map[string]int, []string) {
linesMap := make(map[string]int)
linesOrdered := []string{}
reader := bufio.NewReader(input)
for {
line, err := reader.ReadBytes(opts.stringDelimiter)
if err == io.EOF {
break
}
lineStr := string(line)
linesOrdered = append(linesOrdered, lineStr)
linesMap[lineStr]++
}
return linesMap, linesOrdered
}
func writeStdout(line string) {
_, err := os.Stdout.WriteString(line)
if err != nil {
// what to do?
os.Exit(1)
}
}
func printLines(linesMap map[string]int, linesOrdered []string, opts options) {
for _, line := range linesOrdered {
if linesMap[line] == 0 {
continue
}
if opts.printEmptyLines && line == string(opts.stringDelimiter) {
writeStdout(line)
continue
}
if opts.printDuplicates {
if linesMap[line] < 2 {
goto discardAndContinue
}
} else if linesMap[line] != 1 {
goto discardAndContinue
}
writeStdout(line)
discardAndContinue:
linesMap[line] = 0
}
}
func parseArgs(args []string) options {
var opts options
opts.stringDelimiter = byte('\n')
for _, opt := range args[1:] {
switch opt {
case "-dup":
opts.printDuplicates = true
case "-empty":
opts.printEmptyLines = true
case "-nul":
opts.stringDelimiter = 0x00
default:
writeStdout("Usage:\n")
writeStdout("uniq [-dup | -empty | -nul]\n")
os.Exit(1)
}
}
return opts
}
func main() {
opts := parseArgs(os.Args)
linesMap, linesArr := scanLines(os.Stdin, opts)
printLines(linesMap, linesArr, opts)
}
@geyslan Regarding the O(n) implementation, I think we can change it to use string pointers in the linesOrdered array also, to avoid waste the double of memory. Using three maps instead of ordered slice requires sorting and complicates the logic.
I propose a third PR with the O(n) implementation and we can evaluate run time and memory usage.
Take a look in the rules of the Complexity section of Rob Pike C style: http://doc.cat-v.org/bell_labs/pikestyle
@tiago4orion did you notice that using 3 maps we don't need to sort anything? The catch is to iterate from 1 until the line count and test if this number exists in the lineMap
, using i
as the key.
Concerning your suggestion to use a slice with pointer to string in the map, I think it can be done, but using a structure to count occurrences.
type struct Line {
Text *string
Count int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line
The inputMap
is necessary only when scanning which lets the results ordered in the linesOrdered
for later use.
The behavior (like GNU uniq) is not expected indeed. It was intentional. Take a look at https://github.com/c0defellas/enzo/issues/29#issuecomment-270253150.
About the new PR, NP. Just waiting more reviews/discussions.
For keep track of the line numbers we could use:
type struct Line {
Text *string
Numbers []int
}
Please, let's discuss this on https://github.com/c0defellas/enzo/issues/29 before any changes.
Hello folks. :smiley:
I did a rebase and push some changes. If it's ok there's still a conundrum to solve https://github.com/c0defellas/enzo/issues/29#issuecomment-274259610 before merging.
@geyslan Awesome =)
I'm very busy right now, sorry! But I'll try to take a look soon (maybe tomorrow).
I have a real use case for uniq tomorrow. I'll give feedback soon.
:)
This PR is for comparison with #27
Signed-off-by: Geyslan G. Bem geyslan@gmail.com