betterscientificsoftware / bssw.io

Better Scientific Software Homepage
https://bssw.io
Other
134 stars 85 forks source link

What topics are we posting blogs about #1155

Open bernhold opened 2 years ago

bernhold commented 2 years ago

We might want to take a look at the distribution of topics in our blog articles versus the full list of topics we currently support. Can we try to solicit/create posts in topic areas that we have low or no coverage?

We could do a similar analysis across the whole site.

markcmiller86 commented 1 year ago

Not sure how much help this is but here is the list of unique terms of at least 7 characters appearing in blog article titles...


Academic    Accelerating    Accepting   Activity    Addressing  Adopting
Advanced    Advances    Afghanistan Analytics   Anniversary Apollos
Applications    Approach    Approaches  Architectures   Article Assessing
Association Assurance   Automated   BSSw.io Banshees    Bestiary
Better  Bloodsuckers    Brains  Bringing    Broadening  Building
Careers Celebrating Challenge   Challenges  Citation    Cleaning
Climate Collaboration   Collegeville    Communities Community   Complexity
Computing   Conceptualizing Conference” Confidence  Connections Consciousness
Continuous  Contribute  Contributions   Correctness Coupling    Craftspeople
Creating    Cultural    Cyberinfrastructure DEI-Focused Data-driven Deployment
Develop Developers  Developing  Development Différence  Discovery
Distributed Ecosystem   Effective   Effectively Encouraging Engineer
Engineering Engineers   Environment Evaluation  Exascale    Experiences
Extreme-Scale   FLOPSWatt   Fellows Fellowship  Flatten Forward
Generation  Globally    Guidance    Hacking High-Quality    Highlights
Implementing    Improve Improved    Improvement Improving   Increasing
Insightful  Institute   Institution Integrating Integration Interns
Introducing Introduction    Keeping Leading Learning    Lifecycle
Locally Long-Term   Long-Timescale  Looking Machine Maintainers
Manager Materials   Medium-Sized    Metrics Modeling    Molecular
Multiphysics    NSF-Sponsored   Navigating  Non-Deterministic   Optimize    Outreach
Overcoming  Package Participation   Pending Performance Personal
Perspectives    Planning    Policies    Portability Porting Practice
Practices   Preparing   Productive  Productivity    Program Programming
Programs    Project Projects    Promoting   Publications    RSE-HPC-2020
RSE-HPC-2021    Refactoring Reflecting  Refresh Refreshment Registries
Remotely    Replacing   Repositories    Research    Resources   Retrospective
Reusable    Scaling Science Sciences    Scientific  Simulation
Software    Software-Focused    Special Spreading   Stories Strategies
Streamlining    Successes   SuperLU Supercomputer   Surfaces    Sustainability
Sustainable Talking Teams   Technology  Terminology Test_doc
Testing Thanks  Training    Transition  Tricks  Trusted
Twitter UNPUBLISHED US-RSE  Understanding   Updates Verification
Visible Webinar Well-Documented Working Workshop    Writing
Master Slave
markcmiller86 commented 1 year ago

Oh, and here's the command I used (from the Articles/Blog directory)...

grep '^# ' *.md | tr ' ' '\n' | grep -v '^.*\.md:#' | grep '.......' | tr -d ",:'\*?!()/" | sort | uniq | paste - - - - - -
bernhold commented 1 year ago

Well that's interesting, I'll have to take a closer look. Though I was actually thinking of "topics" specifically in the sense that we have in our metadata. Sorry for not being clear.

rinkug commented 1 year ago

EDIT: The scripts dont work well right now; just leaving them here for now till I come back to them

I am adding two scripts below

SCRIPT1: This script is designed to recursively search all files in a directory and its subdirectories and extract topic names. The input is The output is a CSV results that includes the (1) resource type, (2) title, (3) file path, and (4) extracted topic list.

To run the script, do the following: chmod +x script1.sh, followed by ./script1.sh ---script 1 start----

#!/bin/bash

# Check if the user wants to display help message
if [[ "$1" == "-h" || "$1" == "--help" ]]; then
    echo "Usage: $0 <input_directory> <output_file>"
    echo "Prints resource type, title, path, and topics for all files under the given directory."
    exit 0
fi

# Check that both the input directory and output file were provided as command line arguments
if [ $# -ne 2 ]; then
    echo "Please provide input directory and output file as command line arguments."
    echo "For more information, run '$0 -h'"
    exit 1
fi

# Assign command line arguments to variables
input_dir="$1"
output_file="$2"

# Function to iterate over files recursively
function process_files {
    # Loop over all files and subdirectories in the directory
    for file in "$1"/*
    do
        if [ -f "$file" ] # Check if the current item is a file
        then
            # Find the line starting with "Topics:"
            topics=$(grep "^Topics:" "$file") # Store the line in a variable called 'topics'

            # If the line was found, extract the sequence of words
            if [ ! -z "$topics" ] # If 'topics' variable is not empty
            then
                # Extract the sequence of words after the "Topics:" line
                topics=$(echo "$topics" | sed 's/^[^:]*://') # Remove everything before the first colon

                # Remove leading and trailing spaces
                topics=$(echo "$topics" | sed 's/^[[:space:]]*//') # Remove leading spaces
                topics=$(echo "$topics" | sed 's/[[:space:]]*$//') # Remove trailing spaces

                # Convert the sequence of words to a CSV list of output_topics
                #  remove any leading or trailing spaces and any extra spaces around the commas, and then replace the commas with ",".
                output_topics=$(echo "$topics" | sed 's/^ *//; s/ *, */,/g; s/,/","/g; s/^ //')

                # Extract the title from the file
                # Extract the line starting with # and remove the hash and space characters
                title=$(grep -m 1 "^#" "$file" | sed 's/^# //')

                # Extract the required words from the file path
                path=$(dirname "$file")
                resource_type=$(echo "$path" | grep -ioE "(blog|howtos|whatis|shortarticle|curatedcontent|events)")

                # Print resource_type, title, path, and output_topics in CSV format
                echo "$resource_type,\"$title\",$file,\"$output_topics\"" >> "$output_file"
            fi
        elif [ -d "$file" ] # Check if the current item is a directory
        then
            # If the file is a directory, recursively call the function on the directory
            process_files "$file"
        fi
    done
}

# Call the function to process files in the directory and subdirectories
process_files "$input_dir"

---script 1 end----

SCRIPT2 The second script takes three command line arguments: the CSV file from script1, a file containing topiclist, and an output file name. The script reads each topic-name from the topiclist input file and searches for it in the CSV file. It then counts the number of occurrences of that topic in the CSV file and saves the "topic name; count" to the output file.

---script 2 start----

#!/bin/bash

# Function to display usage/help message
usage() {
    echo "Usage: $0 -c <csv_file> -s <search_text_file> -o <output_file>"
    echo "Example: $0 -c /path/to/csv/file.csv -s /path/to/search/text/file.txt -o /path/to/output/results.txt"
    exit 1
}

# Parse command line arguments
while getopts ":c:s:o:" opt; do
  case ${opt} in
    c )
      csv_file=$OPTARG
      ;;
    s )
      search_text_file=$OPTARG
      ;;
    o )
      output_file=$OPTARG
      ;;
    \? )
      usage
      ;;
    : )
      echo "Invalid option: $OPTARG requires an argument" 1>&2
      usage
      ;;
  esac
done
shift $((OPTIND -1))

# Check if required command line arguments are provided
if [[ -z $csv_file ]] || [[ -z $search_text_file ]] || [[ -z $output_file ]]
then
    usage
fi

# Create an empty array to store the counts of each search text
declare -a count_array

# Loop to read each line of the search text input file and search for each text in the CSV file
# Results are appended to the output file along with the search text and count of files for each result
# The count of files for each search text is added to the count_array
while read -r text
do
    echo "Search text: $text" >> "$output_file"
    # Search for text in second and third column and append to output file
    grep -i "$text" "$csv_file" | cut -d "," -f 1,2,3 >> "$output_file"
    count=$(grep -ic "$text" "$csv_file")
    echo "Total Files: $count" >> "$output_file"
    echo "" >> "$output_file"
    count_array+=("$text:$count")
done < "$search_text_file"

# Add a table at the end of the file listing showing the count of files for each search text
echo "A table showing topic name, followed by article count (based on input csv file)" >> "$output_file"
for text_count in "${count_array[@]}"
do
    text=$(echo "$text_count" | cut -d ":" -f 1)
    count=$(echo "$text_count" | cut -d ":" -f 2)
    echo "$text,  $count" >> "$output_file"
done

---script 2 end----

Script2 also needs a list of all topics in a file (which we can call topics.txt)


Better category
Software process improvement
Software engineering
Requirements
Design
Software interoperability
Software sustainability
Development category
Documentation
Configuration and builds
Revision control
Release and deployment
Issue tracking
Programming languages
Development tools
Refactoring
Performance category
High-performance computing (HPC)
Performance at leadership computing facilities (LCFs)
Performance portability
Cloud computing
Big data
Reliability category
Peer code review
Testing
Continuous integration testing
Reproducibility
Debugging
Collaboration category
Projects and organizations
Strategies for more effective teams
Inclusivity
Funding sources and programs
Software publishing and citation
Licensing
Discussion and question sites
Conferences and workshops
Research Software Engineers
Skills category
Online learning
In-Person learning
Personal productivity and sustainability

To Run and get list of topics and their count for blogs, curated content, and events, do the following.

Download the repo to laptop and open terminal and copy the two scripts and topic.txt file
Run the below to get statistics of articles directory
./script1.sh /path-to-bssw.io-repo/Articles results-articles.csv
./script2.sh -c results-articles.csv -s topics.txt -o results-articles-count.txt

Run the below to get statistics of curated directory
./script1.sh /path-to-bssw.io-repo/curatedcontent results-curated.csv
./script2.sh -c results-curated.csv -s topics.txt -o results-curated-count.txt

Run the below to get statistics of events directory
./script1.sh /path-to-bssw.io-repo/events results-events.csv
./script2.sh -c results-events.csv -s topics.txt -o results-events-count.txt

Run the below to get statistics of all under bssw.io (which includes articles, cc, events directory as well)
./script1.sh /path-to-bssw.io-repo/ results-all.csv
./script2.sh -c results-all.csv -s topics.txt -o results-all-count.txt
markcmiller86 commented 1 year ago

Aweseome @rinkug 💪🏻

Would suggest adjusting script to sort results in descending count 😉