Open bernhold opened 2 years ago
Not sure how much help this is but here is the list of unique terms of at least 7 characters appearing in blog article titles...
Academic Accelerating Accepting Activity Addressing Adopting
Advanced Advances Afghanistan Analytics Anniversary Apollos
Applications Approach Approaches Architectures Article Assessing
Association Assurance Automated BSSw.io Banshees Bestiary
Better Bloodsuckers Brains Bringing Broadening Building
Careers Celebrating Challenge Challenges Citation Cleaning
Climate Collaboration Collegeville Communities Community Complexity
Computing Conceptualizing Conference” Confidence Connections Consciousness
Continuous Contribute Contributions Correctness Coupling Craftspeople
Creating Cultural Cyberinfrastructure DEI-Focused Data-driven Deployment
Develop Developers Developing Development Différence Discovery
Distributed Ecosystem Effective Effectively Encouraging Engineer
Engineering Engineers Environment Evaluation Exascale Experiences
Extreme-Scale FLOPSWatt Fellows Fellowship Flatten Forward
Generation Globally Guidance Hacking High-Quality Highlights
Implementing Improve Improved Improvement Improving Increasing
Insightful Institute Institution Integrating Integration Interns
Introducing Introduction Keeping Leading Learning Lifecycle
Locally Long-Term Long-Timescale Looking Machine Maintainers
Manager Materials Medium-Sized Metrics Modeling Molecular
Multiphysics NSF-Sponsored Navigating Non-Deterministic Optimize Outreach
Overcoming Package Participation Pending Performance Personal
Perspectives Planning Policies Portability Porting Practice
Practices Preparing Productive Productivity Program Programming
Programs Project Projects Promoting Publications RSE-HPC-2020
RSE-HPC-2021 Refactoring Reflecting Refresh Refreshment Registries
Remotely Replacing Repositories Research Resources Retrospective
Reusable Scaling Science Sciences Scientific Simulation
Software Software-Focused Special Spreading Stories Strategies
Streamlining Successes SuperLU Supercomputer Surfaces Sustainability
Sustainable Talking Teams Technology Terminology Test_doc
Testing Thanks Training Transition Tricks Trusted
Twitter UNPUBLISHED US-RSE Understanding Updates Verification
Visible Webinar Well-Documented Working Workshop Writing
Master Slave
Oh, and here's the command I used (from the Articles/Blog
directory)...
grep '^# ' *.md | tr ' ' '\n' | grep -v '^.*\.md:#' | grep '.......' | tr -d ",:'\*?!()/" | sort | uniq | paste - - - - - -
Well that's interesting, I'll have to take a closer look. Though I was actually thinking of "topics" specifically in the sense that we have in our metadata. Sorry for not being clear.
I am adding two scripts below
SCRIPT1: This script is designed to recursively search all files in a directory and its subdirectories and extract topic names. The input is The output is a CSV results that includes the (1) resource type, (2) title, (3) file path, and (4) extracted topic list.
To run the script, do the following: chmod +x script1.sh, followed by ./script1.sh ---script 1 start----
#!/bin/bash
# Check if the user wants to display help message
if [[ "$1" == "-h" || "$1" == "--help" ]]; then
echo "Usage: $0 <input_directory> <output_file>"
echo "Prints resource type, title, path, and topics for all files under the given directory."
exit 0
fi
# Check that both the input directory and output file were provided as command line arguments
if [ $# -ne 2 ]; then
echo "Please provide input directory and output file as command line arguments."
echo "For more information, run '$0 -h'"
exit 1
fi
# Assign command line arguments to variables
input_dir="$1"
output_file="$2"
# Function to iterate over files recursively
function process_files {
# Loop over all files and subdirectories in the directory
for file in "$1"/*
do
if [ -f "$file" ] # Check if the current item is a file
then
# Find the line starting with "Topics:"
topics=$(grep "^Topics:" "$file") # Store the line in a variable called 'topics'
# If the line was found, extract the sequence of words
if [ ! -z "$topics" ] # If 'topics' variable is not empty
then
# Extract the sequence of words after the "Topics:" line
topics=$(echo "$topics" | sed 's/^[^:]*://') # Remove everything before the first colon
# Remove leading and trailing spaces
topics=$(echo "$topics" | sed 's/^[[:space:]]*//') # Remove leading spaces
topics=$(echo "$topics" | sed 's/[[:space:]]*$//') # Remove trailing spaces
# Convert the sequence of words to a CSV list of output_topics
# remove any leading or trailing spaces and any extra spaces around the commas, and then replace the commas with ",".
output_topics=$(echo "$topics" | sed 's/^ *//; s/ *, */,/g; s/,/","/g; s/^ //')
# Extract the title from the file
# Extract the line starting with # and remove the hash and space characters
title=$(grep -m 1 "^#" "$file" | sed 's/^# //')
# Extract the required words from the file path
path=$(dirname "$file")
resource_type=$(echo "$path" | grep -ioE "(blog|howtos|whatis|shortarticle|curatedcontent|events)")
# Print resource_type, title, path, and output_topics in CSV format
echo "$resource_type,\"$title\",$file,\"$output_topics\"" >> "$output_file"
fi
elif [ -d "$file" ] # Check if the current item is a directory
then
# If the file is a directory, recursively call the function on the directory
process_files "$file"
fi
done
}
# Call the function to process files in the directory and subdirectories
process_files "$input_dir"
---script 1 end----
SCRIPT2 The second script takes three command line arguments: the CSV file from script1, a file containing topiclist, and an output file name. The script reads each topic-name from the topiclist input file and searches for it in the CSV file. It then counts the number of occurrences of that topic in the CSV file and saves the "topic name; count" to the output file.
---script 2 start----
#!/bin/bash
# Function to display usage/help message
usage() {
echo "Usage: $0 -c <csv_file> -s <search_text_file> -o <output_file>"
echo "Example: $0 -c /path/to/csv/file.csv -s /path/to/search/text/file.txt -o /path/to/output/results.txt"
exit 1
}
# Parse command line arguments
while getopts ":c:s:o:" opt; do
case ${opt} in
c )
csv_file=$OPTARG
;;
s )
search_text_file=$OPTARG
;;
o )
output_file=$OPTARG
;;
\? )
usage
;;
: )
echo "Invalid option: $OPTARG requires an argument" 1>&2
usage
;;
esac
done
shift $((OPTIND -1))
# Check if required command line arguments are provided
if [[ -z $csv_file ]] || [[ -z $search_text_file ]] || [[ -z $output_file ]]
then
usage
fi
# Create an empty array to store the counts of each search text
declare -a count_array
# Loop to read each line of the search text input file and search for each text in the CSV file
# Results are appended to the output file along with the search text and count of files for each result
# The count of files for each search text is added to the count_array
while read -r text
do
echo "Search text: $text" >> "$output_file"
# Search for text in second and third column and append to output file
grep -i "$text" "$csv_file" | cut -d "," -f 1,2,3 >> "$output_file"
count=$(grep -ic "$text" "$csv_file")
echo "Total Files: $count" >> "$output_file"
echo "" >> "$output_file"
count_array+=("$text:$count")
done < "$search_text_file"
# Add a table at the end of the file listing showing the count of files for each search text
echo "A table showing topic name, followed by article count (based on input csv file)" >> "$output_file"
for text_count in "${count_array[@]}"
do
text=$(echo "$text_count" | cut -d ":" -f 1)
count=$(echo "$text_count" | cut -d ":" -f 2)
echo "$text, $count" >> "$output_file"
done
---script 2 end----
Script2 also needs a list of all topics in a file (which we can call topics.txt)
Better category
Software process improvement
Software engineering
Requirements
Design
Software interoperability
Software sustainability
Development category
Documentation
Configuration and builds
Revision control
Release and deployment
Issue tracking
Programming languages
Development tools
Refactoring
Performance category
High-performance computing (HPC)
Performance at leadership computing facilities (LCFs)
Performance portability
Cloud computing
Big data
Reliability category
Peer code review
Testing
Continuous integration testing
Reproducibility
Debugging
Collaboration category
Projects and organizations
Strategies for more effective teams
Inclusivity
Funding sources and programs
Software publishing and citation
Licensing
Discussion and question sites
Conferences and workshops
Research Software Engineers
Skills category
Online learning
In-Person learning
Personal productivity and sustainability
To Run and get list of topics and their count for blogs, curated content, and events, do the following.
Download the repo to laptop and open terminal and copy the two scripts and topic.txt file
Run the below to get statistics of articles directory
./script1.sh /path-to-bssw.io-repo/Articles results-articles.csv
./script2.sh -c results-articles.csv -s topics.txt -o results-articles-count.txt
Run the below to get statistics of curated directory
./script1.sh /path-to-bssw.io-repo/curatedcontent results-curated.csv
./script2.sh -c results-curated.csv -s topics.txt -o results-curated-count.txt
Run the below to get statistics of events directory
./script1.sh /path-to-bssw.io-repo/events results-events.csv
./script2.sh -c results-events.csv -s topics.txt -o results-events-count.txt
Run the below to get statistics of all under bssw.io (which includes articles, cc, events directory as well)
./script1.sh /path-to-bssw.io-repo/ results-all.csv
./script2.sh -c results-all.csv -s topics.txt -o results-all-count.txt
Aweseome @rinkug 💪🏻
Would suggest adjusting script to sort results in descending count 😉
We might want to take a look at the distribution of topics in our blog articles versus the full list of topics we currently support. Can we try to solicit/create posts in topic areas that we have low or no coverage?
We could do a similar analysis across the whole site.