data-lessons / library-shell-DEPRECATED

Unix shell lesson for librarians NOW MOVED > https://github.com/LibraryCarpentry/lc-shell
https://github.com/LibraryCarpentry/lc-shell
Other
9 stars 19 forks source link

Patch 1 #54

Open bpnx opened 7 years ago

bpnx commented 7 years ago

Hello, I think that 3 examples should be changed: after a "uniq -c" command a "-n" flag on sort command is very useful. About sed command: I think that removing spaces or normalizing them is a very common need and the example using sed could be appreciated, in my opinion

drjwbaker commented 7 years ago

FYI: I have a bigger PR coming on this (probably next week)

bpnx commented 7 years ago

Hello, sed is used because that tr command generates blank lines in case of repeated spaces, and normally it's not good; in a single command sed substitutes spaces and tabs, repeated and mixed a gives a clean ouput. Since basic regular expressions are covered on episode 02 and sed is used in this episode, I think that it's useful for people to see that example of sed usage. I don't think it's intimidating, no more than: grep -Eo '\d{4}-\d{4}' 2014-01_JA.tsv (taken fron episode 02)

weaverbel commented 7 years ago

OK, I suppose this makes sense. I can merge if other people agree. @drjwbaker ?

drjwbaker commented 7 years ago

Agreed. Looks a great change.

danmichaelo commented 7 years ago

The command looks intimidating, but agree that it's not worse than other stuff in this episode. It's also more realistic, and perhaps it's good to stress regexps rather than introducing yet another command (tr, which I at least never use in practice myself). So +1

Perhaps change "repeated space" to "repeated whitespace"?

Also, is it necessary/correct to escape + as \+ here?

bpnx commented 7 years ago

Hello, I'm changing "repeated space" to "repeated whitespace", thank you. Escaping + is necessary, otherwise sed treats it as a simple character (I'm sure someone knows why :-) )

danmichaelo commented 7 years ago

Ouch, \s doesn't work on BSD/Mac: https://superuser.com/a/112837/56154

Even if we use the [[:space:]] group instead, it still fails to insert newlines: https://stackoverflow.com/questions/10748453/replace-comma-with-newline-in-sed

Test:

$ echo "Hello world" | sed 's/[[:space:]]/\n/g'
Hellonworld

Perhaps we need to stick with tr anyways?

bpnx commented 7 years ago

Hello, it's because of different sed versions, the one I've used in example is GNU sed, installed on every linux distribution: it supports extended regular expressions and has GNU extensions (often supported) . Unfortunately I don't have a Mac to test; have you tried with "-E" or "-r" options (if available) ? Or you could install gnu-sed package and it should work. However, to be more portable, the sed part could be rewritten into: sed 's/[[:space:]]+/ /g' | tr ' ' '\n' | sed '/^$/d' First it changes (repeated) whitespaces and tabs into a single space, then tr changes them into newlines and then the final sed removes all blank lines (for rows with spaces/tabs at the beginning or at the end) Or tr ' \t' '\n\n' |sed '/^$/d' simple, it should work anywhere, even with older sed versions, but students don't see the use of sed to remove repeated spaces and tabs.

bpnx commented 7 years ago

Forgot to say: thank you for your analysis!

danmichaelo commented 7 years ago

I'm sorry, it's really annoying when you find a nice example command, just to learn that it doesn't work cross platform.. Anyways, -E is available, but didn't help. Unless we add GNU sed to our install requirements, we probably have to live with this.

Can confirm that this works:

sed -E 's/[[:space:]]+/ /g' | tr ' ' '\n' | sed '/^$/d'

(Had to include -E in the first sed)

The last example also, not sure which one is the best to teach..

bpnx commented 7 years ago

I forgot to set the "-E" flag, sorry. About the last example: in my environment also tr [:space:] '\n' works (note the single \n), I think in BSD/Mac should work too Please let me know what example you think it's better to teach, so I change the PR Thank you