Pipelines are common in most walks of life, digital circuits, software, transportaion, industries, sales.... pipelines are almost everywhere! Pipelines used for data analysis take inputs which go through a number of processing steps that are chained together in some way to produce the desired output. They are sort of a chain of commands that can be run on one or more data sets - very helpful when we are going to have to rerun any analysis especially with multiple files. Makefiles are used to describe a pipeline of shell commands and the interdependencies of the input and output files of those commands.
Problem at hand:
I have a number of text files that contain words. Unfortunately, the files are disintegrated. They could have come from another pipeline as separate pieces of the same file and I would like to combine all of them into on large master file that I can use as my dataset.
The Pipeline
The pipeline starts with a python component seed, merger.py
that traverses the system directory tree of the specified path and looks for files with a certain pattern. The script then takes the files and concatenates them into one (Assuming that all files from a "prior" pipeline were put in one directory)
The output of merger.py
file is dataset_merge.txt. This output file is then fed into the trump_words.R file that does required analyses on dataset_merge.txt
and generates required plots as shown in the md file. This md file was entirely generated for visualization.
Install python using any of the methods specified here or here based on your operating system
Change the path in the merger.py
file: path="/Users/rasiimwe/hw09-rasiimwe/files/"
to align with your system environment.
install.packages("tm") # to support text mining
install.packages("SnowballC") # to support text stemming
install.packages("wordcloud") # word-cloud generator
Homework Files | Description |
---|---|
README.md | This readme.md file provides an overview of the ghist of this repo and provides useful pointers to key files in my homework-09 repo. Herein, are also links to past files that provide an introduction to data exploration and analysis |
Link to Makefile | This file describes the pipeline commands and the interdependencies of each of the input and output files |
Link to md file | This file was rendered purposely for visualization |
Link to R script | R source code that does the analyses on the merged dataset and provides required pipeline plots |
Files | Directory that contains the base files that were used for the merger done by the python script (*.txt) |
STAT 5457M notes on automating Data-analysis Pipelines
Text mining and word cloud fundamentals in R : 5 simple steps you should know