FredHutch / wiki

SciWiki: Collective KnowledgeBase for Scientific Data and Use
https://sciwiki.fredhutch.org
Other
36 stars 44 forks source link

ELI5: tr, sed and awk AKA: `coreutils` in our Linux page #455

Open vortexing opened 4 years ago

vortexing commented 4 years ago

Proposed Domain Scientific computing - linux 101 page

Content Summary Can someone provide some introductory material from the web or wherever about why you should stop and learn how to use these commands to do things, such as, manipulating files for data cleaning?

Local Content Expert(s) @atombaby or @k8hertweck if you happen to have any links handy on this!!!

bmcgough commented 4 years ago

Oooh, fun! I'd love to write something for this, but web links are probably a better/faster option, yes. I'm not a sed expert, but tr and awk are some of my best friends. This should include a link to explainshell as it covers these tools.

Actually, I have been wanting to do a page on coreutils - the answer to any and all shell-based data manipulation issues! (coreutils includes tr, cut, uniq, sort, head, tail, paste, comm, and one of my favorite - tac among many many others).

That being said, I have never found a really good single source of information on these tools... . :(

vortexing commented 4 years ago

If you write it (slowly, in parts), they will read it. And by they, I certainly mean at least me. I know I can't be alone here. Also, I THOUGHT I put in the explainshell.com link.... hmmmm. Where did I put it (maybe in my deleted PR?)?

I'm in full support of a coreutils for dummies section in this markdown: https://github.com/FredHutch/wiki/blob/master/_scicomputing/software_linux101.md

Ping me for editing help if it's helpful!

MattJensenData commented 4 years ago

Peter Caton's explanations of one-liners are great. He later published them in book form, too. awk: https://catonmat.net/awk-one-liners-explained-part-one sed: https://catonmat.net/sed-one-liners-explained-part-one

ptvan commented 4 years ago

On a related note, Data Science at the Command Line (https://www.datascienceatthecommandline.com/) is an excellent resource which has some coverage of sed/awk/tr, etc. Though likely more a Resource Library type thing...

atombaby commented 4 years ago

We had a discussion in the wiki-writers session about what this page might look like without duplicating too much that's already available on other sites.

The goal for these docs should give someone without experience with these command line text processing tools enough information to be able to search available documentation and external resources for answers to their specific question. For example, information like:

You get the idea. Then we can link to useful sources after that brief introduction.

bmcgough commented 4 years ago

Good plan. I've reviewed a number of external resources suggested here and found elsewhere, and while there is a lot of good information out there, there are a lot of concepts taken for granted, such as:

While the link above from @ptvan is great and explains these things, it is not as accessible as a page in our wiki.

Is it even possible to teach these things quickly and concisely? Is it worth teaching more advanced pipelines without understanding these things?

I also think an upside-down (procedure-based) version of what @atombaby suggests:

ptvan commented 4 years ago

I agree that one of the wiki's main strengths is good introductions ("Core Utils 101") from which readers could jump off into more advanced resources. Thanks for all your continued work !