c0defellas / enzo

core utilities
11 stars 2 forks source link

Missing uniq command #29

Open i4ki opened 7 years ago

i4ki commented 7 years ago

We need a uniq command, but how it should behave?

@katcipis @geyslan @cadicallegari

Differently from Plan9 and GNU uniq, it should apply the uniq in the entire input buffer, not only in the adjacent input lines. The current @geyslan implementations already did this way (#27 #28).

Below are some test cases I do expect to work:

$ cat > file.txt
1
1
1
2

1
3
3
1
# by default, print every string one time (output have unique entries).
# empty lines are ignored
$ cat file.txt | uniq
1
2
3
# -dup  print only lines that are duplicated in the input
$ cat file.txt | uniq -dup
1
3
# -empty print empty lines also
$ cat file.txt | uniq -empty
1
2

3
$ 

A third option to show line numbers could be added if it do not complicates the tool.

@geyslan Current implementation do not honor this cases. I know it's not like 'gnu uniq'. But what do you think? Makes sense?

geyslan commented 7 years ago

@tiago4orion @katcipis @cadicallegari

I really never used the GNU uniq as it is. If I would like to grep uniques or duplicates I've used awk or grep.

From the GNU uniq manual:

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.

Adjacent lines? I think it's misguided. It can be me the wrong here. :-) For that actually work it demands to use sort, and that already has an option for unique. LOL.

From the GNU sort manual:

-u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run

Highlighting that in any case, comparing all occurrences or adjacency the complexity will be O(n), let me know what do you think.

i4ki commented 7 years ago

Ok, but I do not used gnu uniq behavior in my examples. Try the file.txt in our uniq implementation.

geyslan commented 7 years ago

Ok, using gnu uniq:

$ cat file.txt | uniq
1
2

1
3
1

It seems to me like a getridofadjacents, not a unique output.

However its duplicate option seems more logical.

cat file.txt | uniq -d
1
3

Only the -d option. since -D mess things up.

cat file.txt | uniq -D
1
1
1
3
3
geyslan commented 7 years ago

sort -u has similar behavior to gnu uniq

cat file.txt | sort -u

1
2
3

In that case sort prints all lines getting rid of duplicates and sorts, or vice versa.

geyslan commented 7 years ago

@tiago4orion

@geyslan Current implementation do not honor this cases. I know it's not like 'gnu uniq'. But what do you think? Makes sense?

I recompiled the branch geyslan/uniq2 and it actually doesn't honor those cases. Well, must have been a wrong merge. #28 is broken, #27 seems to be ok. I'm sorry about that. Please, consider #27 for testing by now. Can we discard #28 and discuss the better implementation? So soon I'll PR'ring again.

See below.

geyslan commented 7 years ago

@tiago4orion I figured out now (because I was sleepy yesterday) that you're expecting a behavior as showmeonlyoneoccurrence in spite of the occurrences of the line. It could be 1 or more. That is similar to sort -u.

@katcipis @cadicallegari

What I think is, whether -dup option inverts the logic showing only duplicates, why not the default behavior should show only actual unique lines? It enables an one occurrence usage like a showmeonlyuniquelines. Of course that we can add an option to output as you expect, or just make it default and add what I'm doing as the option instead.

So, actually #27 is broken from my point of view and #28 is ok. I'm sorry for that mess. I suggest we discuss all here before moving into code. I'm postponing any PR's fixes.

i4ki commented 7 years ago

@geyslan Sorry for the late reply..

Yes, I've used sort -u several time instead of uniq because output is much more sane (to me), but some times I do not want the output actually sorted.

Can you describe with examples how you would like the tool? Showing input and output, like I made in the issue description.

geyslan commented 7 years ago

@tiago4orion ok, it's better exemplify so we can get it right. But first I want to paste a few meanings to make my point accordingly.

unique from Oxford Dictionary:

1 Being the only one of its kind; unlike anything else: ‘the situation was unique in British politics’ ‘original and unique designs’

unique from Online Etymology Dictionary

c. 1600, "single, solitary," from Middle French unique (16c.), from Latin unicus "only, single, sole, alone of its kind," from unus "one" (see one). Meaning "forming the only one of its kind" is attested from 1610s; erroneous sense of "remarkable, uncommon" is attested from mid-19c. Related: Uniquely; uniqueness.

By now we can accept that unique is a sense of something that is only one, (alone) in a set. Right?

# Same input as above
$ cat > file.txt
1
1
1
2

1
3
3
1
# by default, print only unique/single/alone entries.
# empty lines are ignored
$ cat file.txt | uniq
2
# -dup  print only lines that are duplicated in the input
$ cat file.txt | uniq -dup
1
3
# -every print only one string representation from all input set.
$ cat file.txt | uniq -every
1
2
3
# -empty print empty lines also
$ cat file.txt | uniq -empty
2

$ 

It's a possibility to change the -every option to be default and my default example to be a -single. Anyway I hope you have got my point.

Cheers.

geyslan commented 7 years ago

Examples for real cases that came to mind.

uniq (default case) could be used to retrieve lines that haven't duplicates, highlighting singular usage.

uniq -every could be used in cases when a file wrongly contains duplicate lines (not adjacent lines, this usage isn't covered by Enzo's uniq) ripping of that duplicates.

uniq -dup could be used to identify lost of space or misleading repetitions.

i4ki commented 7 years ago

@tiago4orion ok, it's better exemplify so we can get it right. But first I want to paste a few meanings to make my point accordingly.

Ok. I got your point, but we use a 'single word' to describe a set of features related to that word, it doesn't need to be so much strict.

I liked your examples except the first. I cannot think of one use case for that. Can you provide a real world example?

My point is that, features that doesn't have real world usage (right now) should be dropped in the first design and implementation. What do you think? I'm not against it being developed in the future if needed, but I think that adding complexity with no advantage in the beginning won't help.

geyslan commented 7 years ago

I liked your examples except the first. I cannot think of one use case for that. Can you provide a real world example?

Here, here and here.

gnu uniq -u already does that.

-u, --unique only print unique lines

People need that usage. I needed it too but I can't remember now for what actual usage.

i4ki commented 7 years ago

Ok, no problem.

Then we can start implementation? Or we're missing some detail?

geyslan commented 7 years ago

Nice,

Before go back to implementation, I would like to hear from you about this understanding.

https://github.com/c0defellas/enzo/pull/28#issuecomment-270254456 and https://github.com/c0defellas/enzo/pull/28#issuecomment-270253257

Last suggestion:

type struct Line {
    Text *string
    Numbers []int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line
geyslan commented 7 years ago

You can ask me why that Numbers []int, right? Well, it's good to supply user when asked about line number through -num option which I forgot to mention above. So the user will be able to track the output lines.

$ cat file.txt | uniq -num
[4] 2

$ cat file.txt | uniq -dup -num
[1 2 3 6 9] 1
[7 8] 3
geyslan commented 7 years ago

@tiago4orion @katcipis

Hello guys,

I made the changes that we have discussed and I implemented the uniq_test.go as well, though there's no commit yet. Right now, I'm having doubts about how the -empty option should behave. By default, all options disregard empty lines, but -empty print them in the same order they were scanned regardless of their occurrences count the other options idiosyncrasies. Eg.

Input

λ> cat input
hello
world

hello

世界
世界
世
1
3
4
日本語
4
1

Unique lines and all empty lines

λ> cat input | ./uniq -empty
world

世
3
日本語

Duplicate lines and all empty lines

λ> cat input | ./uniq -dup -empty
hello

世界
1
4

Every line representation and all empty lines

λ> cat input | ./uniq -every -empty
hello
world

世界
世
1
3
4
日本語

Whit -num

λ> cat input | ./uniq -dup -empty -num
1,4: hello
3,5: 
3,5: 
6,7: 世界
9,14: 1
11,13: 4

So, do you think that -empty is doing the right thing printing all occurrences or should it behave like the other options which print only one specific representation?

i4ki commented 7 years ago

I think it should behave like any other character.. one representation.

geyslan commented 7 years ago

@tiago4orion, tks. I'll change it soon. :+1:

geyslan commented 7 years ago

@tiago4orion Done!