fabriziosalmi / blacklists

Hourly updated domains blacklist 🚫
https://fabriziosalmi.github.io/blacklists/
GNU General Public License v3.0
143 stars 7 forks source link

Sanitize it! #1

Closed fabriziosalmi closed 1 year ago

fabriziosalmi commented 1 year ago

To clean up the file as per your requirements, you can use the following bash script. This script will:

  1. Remove comments (lines starting with #).
  2. Remove any extra content after the FQDN in each line.
  3. Remove any lines that don't start with a valid FQDN.
#!/bin/bash

# Function to remove comments, extra content, and invalid lines
clean_fqdns() {
    local input_file="$1"

    # Remove comments (lines starting with #)
    sed -i '/^\s*#/d' "$input_file"

    # Remove extra content after FQDN in each line
    sed -i 's/\([^[:space:]]*\).*/\1/' "$input_file"

    # Remove lines not starting with a valid FQDN
    grep -E '^[a-zA-Z0-9.-]+$' "$input_file" > "${input_file}.cleaned"
}

# Test the function with your file "input_file.txt"
clean_fqdns "input_file.txt"

echo "File cleaned and saved as 'input_file.txt.cleaned'"

Save the script in a file (e.g., clean_fqdns.sh) and make it executable with chmod +x clean_fqdns.sh. Place the script in the same directory as your input file (e.g., input_file.txt), and then run the script with ./clean_fqdns.sh.

The script will clean up the file and create a new file called input_file.txt.cleaned, which will contain only one FQDN per row without any comments or extra content.

fabriziosalmi commented 1 year ago

Let's add progress with pv and more iterations to ensure a well-formatted final file with one FQDN per row. We'll modify the script to use multiple sed commands for better handling of the cleaning process.

Install pv if you haven't already using the package manager of your system (e.g., apt, yum, brew, etc.). Then, update the script as follows:

#!/bin/bash

# Function to remove comments, extra content, and invalid lines
clean_fqdns() {
    local input_file="$1"
    local temp_file="${input_file}.tmp"
    local cleaned_file="${input_file}.cleaned"

    # Remove comments (lines starting with #) and progress with pv
    pv "$input_file" | sed '/^\s*#/d' > "$temp_file"

    # Remove extra content after FQDN in each line
    pv "$temp_file" | sed 's/\([^[:space:]]*\).*/\1/' > "$cleaned_file"

    # Remove lines not starting with a valid FQDN
    pv "$cleaned_file" | grep -E '^[a-zA-Z0-9.-]+$' > "${input_file}.formatted"

    # Clean up temporary files
    rm "$temp_file" "$cleaned_file"
}

# Test the function with your file "input_file.txt"
clean_fqdns "input_file.txt"

echo "File cleaned and formatted as 'input_file.txt.formatted'"

Make the script executable with chmod +x clean_fqdns.sh and run it with ./clean_fqdns.sh.

Now the script will display progress using pv during each step of the cleaning process, and it will save the final well-formatted file with one FQDN per row as input_file.txt.formatted. The intermediate temporary files will be cleaned up automatically.

With this approach, the script should handle the input file more efficiently and produce the desired clean output.

fabriziosalmi commented 1 year ago
#!/bin/bash

# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
    local input_string="$1"
    local fqdn_regex='^[a-zA-Z0-9.-]+$'

    if [[ $input_string =~ $fqdn_regex ]]; then
        return 0
    else
        return 1
    fi
}

# Contatore per il numero di FQDN validi
valid_fqdns=0

# Testa il file specificato come primo argomento dello script
input_file="$1"

# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
    echo "Errore: File non trovato."
    exit 1
fi

# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=0

# Loop attraverso ogni riga del file e controlla se è un FQDN valido
while IFS= read -r line; do
    ((total_lines++))
    if is_valid_fqdn "$line"; then
        ((valid_fqdns++))
    fi
done < "$input_file"

# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
    percentage=$((valid_fqdns * 100 / total_lines))
else
    percentage=0
fi

# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe: $total_lines"
echo "Numero di FQDN validi: $valid_fqdns"
echo "Percentuale di FQDN validi: $percentage%"`
fabriziosalmi commented 1 year ago
#!/bin/bash

# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
    local input_string="$1"
    local fqdn_regex='^[a-zA-Z0-9.-]+$'

    if [[ $input_string =~ $fqdn_regex ]]; then
        return 0
    else
        return 1
    fi
}

# Contatore per il numero di FQDN validi
valid_fqdns=0

# Testa il file specificato come primo argomento dello script
input_file="$1"

# Nome del file per salvare i FQDN validi
valid_fqdn_file="valid_fqdns.txt"

# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
    echo "Errore: File non trovato."
    exit 1
fi

# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=0

# Loop attraverso ogni riga del file e controlla se è un FQDN valido
while IFS= read -r line; do
    ((total_lines++))
    if is_valid_fqdn "$line"; then
        ((valid_fqdns++))
        echo "$line" >> "$valid_fqdn_file"
    fi
done < "$input_file"

# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
    percentage=$((valid_fqdns * 100 / total_lines))
else
    percentage=0
fi

# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe: $total_lines"
echo "Numero di FQDN validi: $valid_fqdns"
echo "Percentuale di FQDN validi: $percentage%"
echo "FQDN validi salvati in: $valid_fqdn_file"
fabriziosalmi commented 1 year ago
#!/bin/bash

# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
    local input_string="$1"
    local fqdn_regex='^[a-zA-Z0-9.-]+$'

    if [[ $input_string =~ $fqdn_regex ]]; then
        return 0
    else
        return 1
    fi
}

# Contatore per il numero di FQDN validi
valid_fqdns=0

# Testa il file specificato come primo argomento dello script
input_file="$1"

# Nome del file per salvare i FQDN validi
valid_fqdn_file="valid_fqdns.txt"

# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
    echo "Errore: File non trovato."
    exit 1
fi

# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=$(wc -l < "$input_file")

# Loop attraverso ogni riga del file e controlla se è un FQDN valido
pv "$input_file" | while IFS= read -r line; do
    # Verifica se la riga inizia con un carattere commento o separatore
    if [[ "$line" =~ ^[[:space:]]*([#!|].*|$) ]]; then
        continue  # Ignora questa riga
    fi

    ((total_lines--))  # Decrementa il totale delle righe per il progresso di pv

    if is_valid_fqdn "$line"; then
        ((valid_fqdns++))
        echo "$line" >> "$valid_fqdn_file"
    fi
done

# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
    percentage=$((valid_fqdns * 100 / total_lines))
else
    percentage=0
fi

# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe con FQDN: $valid_fqdns"
echo "Percentuale di righe con FQDN validi: $percentage%"
echo "FQDN validi salvati in: $valid_fqdn_file"
fabriziosalmi commented 1 year ago
#!/bin/bash

# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
    local input_string="$1"
    local fqdn_regex='^[a-zA-Z0-9.-]+$'

    if [[ $input_string =~ $fqdn_regex ]]; then
        return 0
    else
        return 1
    fi
}

# Contatore per il numero di FQDN validi
valid_fqdns=0

# Testa il file specificato come primo argomento dello script
input_file="$1"

# Nome del file per salvare i FQDN validi
valid_fqdn_file="valid_fqdns.txt"

# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
    echo "Errore: File non trovato."
    exit 1
fi

# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=$(wc -l < "$input_file")

# Loop attraverso ogni riga del file e controlla se è un FQDN valido
pv "$input_file" | while IFS= read -r line; do
    # Verifica se la riga contiene caratteri non validi per un FQDN
    if [[ "$line" =~ [^a-zA-Z0-9.-] ]]; then
        continue  # Ignora questa riga
    fi

    ((total_lines--))  # Decrementa il totale delle righe per il progresso di pv

    if is_valid_fqdn "$line"; then
        ((valid_fqdns++))
        echo "$line" >> "$valid_fqdn_file"
    fi
done

# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
    percentage=$((valid_fqdns * 100 / total_lines))
else
    percentage=0
fi

# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe con FQDN: $valid_fqdns"
echo "Percentuale di righe con FQDN validi: $percentage%"
echo "FQDN validi salvati in: $valid_fqdn_file"
fabriziosalmi commented 1 year ago

just ip addresses and/or ip ranges (ipv4/ipv6) left to remove from aggregated file

fabriziosalmi commented 1 year ago

Certainly! We can further optimize the script to handle weird characters and cases by using more robust techniques for file handling and string manipulations. Additionally, we can improve the regular expressions used for FQDN validation. Here's the further optimized version:

#!/bin/bash

if ! apt-get install pv -y; then
  echo "Error: Failed to install 'pv' package."
  exit 1
fi

LISTS="blacklists.fqdn.urls"

download_url() {
  local url="$1"
  random_filename=$(uuidgen | tr -dc '[:alnum:]')

  if ! wget -q --progress=bar:force -O "$random_filename.fqdn.list" "$url"; then
    echo "Failed to download: $url"
  fi
}

echo "Download blacklists"

while IFS= read -r url; do
  download_url "$url"
done < "$LISTS"

echo "Aggregate blacklists"
cat *.fqdn.list > aggregated.fqdn.list
sort -u -o all.fqdn.blacklist aggregated.fqdn.list
rm *.fqdn.list

echo "Sanitize blacklists"
sed -i -E 's/^0\.0\.0\.0|^127\.0\.0\.1//' all.fqdn.blacklist

is_valid_fqdn() {
  local input_string="$1"
  local fqdn_regex='^[a-zA-Z0-9.-]+$'

  if [[ "$input_string" =~ $fqdn_regex && "${input_string:0:1}" != "-" && "${input_string: -1}" != "-" ]]; then
    return 0
  else
    return 1
  fi
}

input_file="all.fqdn.blacklist"
valid_fqdn_file="all.fqdn.blacklist.tmp"

if [ ! -f "$input_file" ]; then
  echo "Error: File $input_file not found."
  exit 1
fi

total_lines=$(wc -l < "$input_file")
valid_fqdns=0

pv "$input_file" | while IFS= read -r line; do
  if ! is_valid_fqdn "$line"; then
    continue
  fi

  ((total_lines--))

  ((valid_fqdns++))
  echo "$line" >> "$valid_fqdn_file"
done

mv "$valid_fqdn_file" "$input_file"
tar -czf all.fqdn.blacklist.tar.gz "$input_file"

echo "Total domains: $total_lines, $valid_fqdns appears to be valid"
echo "Blacklist plain: https://github.com/fabriziosalmi/blacklists/blob/main/all.fqdn.blacklist"
echo "Blacklist compressed: https://github.com/fabriziosalmi/blacklists/blob/main/all.fqdn.blacklist.tar.gz"

In this version, we've made the following optimizations:

  1. Changed the sed command to use the -E option for extended regular expressions, which simplifies the pattern for removing IP addresses from FQDNs.
  2. Improved the FQDN validation regex to check for invalid characters at the beginning and end of the domain name.
  3. Removed the unnecessary check for special characters in the while loop for FQDN validation, as it is now handled in the is_valid_fqdn function.

These optimizations should make the script more robust in handling various cases and weird characters in the domain names.

fabriziosalmi commented 1 year ago

is_valid_fqdn improved

is_valid_fqdn() {
  local input_string="$1"
  local valid_tlds=("com" "org" "net" "edu" "gov" "mil" "int" "info" "biz" "name" "museum" "coop" "aero" "arpa" "asia" "cat" "jobs" "mobi" "post" "tel" "travel" "xxx" "ac" "ad" "ae" "af" "ag" "ai" "al" "am" "an" "ao" "aq" "ar" "as" "at" "au" "aw" "ax" "az" "ba" "bb" "bd" "be" "bf" "bg" "bh" "bi" "bj" "bm" "bn" "bo" "br" "bs" "bt" "bv" "bw" "by" "bz" "ca" "cc" "cd" "cf" "cg" "ch" "ci" "ck" "cl" "cm" "cn" "co" "cr" "cu" "cv" "cw" "cx" "cy" "cz" "de" "dj" "dk" "dm" "do" "dz" "ec" "ee" "eg" "er" "es" "et" "eu" "fi" "fj" "fk" "fm" "fo" "fr" "ga" "gd" "ge" "gf" "gg" "gh" "gi" "gl" "gm" "gn" "gp" "gq" "gr" "gs" "gt" "gu" "gw" "gy" "hk" "hm" "hn" "hr" "ht" "hu" "id" "ie" "il" "im" "in" "io" "iq" "ir" "is" "it" "je" "jm" "jo" "jp" "ke" "kg" "kh" "ki" "km" "kn" "kp" "kr" "kw" "ky" "kz" "la" "lb" "lc" "li" "lk" "lr" "ls" "lt" "lu" "lv" "ly" "ma" "mc" "md" "me" "mg" "mh" "mk" "ml" "mm" "mn" "mo" "mp" "mq" "mr" "ms" "mt" "mu" "mv" "mw" "mx" "my" "mz" "na" "nc" "ne" "nf" "ng" "ni" "nl" "no" "np" "nr" "nu" "nz" "om" "pa" "pe" "pf" "pg" "ph" "pk" "pl" "pm" "pn" "pr" "ps" "pt" "pw" "py" "qa" "re" "ro" "rs" "ru" "rw" "sa" "sb" "sc" "sd" "se" "sg" "sh" "si" "sj" "sk" "sl" "sm" "sn" "so" "sr" "ss" "st" "su" "sv" "sx" "sy" "sz" "tc" "td" "tf" "tg" "th" "tj" "tk" "tl" "tm" "tn" "to" "tp" "tr" "tt" "tv" "tw" "tz" "ua" "ug" "uk" "us" "uy" "uz" "va" "vc" "ve" "vg" "vi" "vn" "vu" "wf" "ws" "ye" "yt" "za" "zm" "zw")

  local fqdn_regex='^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])\.([a-zA-Z]{2,}|[a-zA-Z]{2,}[a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])$'

  local domain="${input_string%.*}"
  local tld="${input_string##*.}"

  if [[ "$input_string" =~ $fqdn_regex && "${input_string:0:1}" != "-" && "${input_string: -1}" != "-" && ${#input_string} -le 255 && " ${valid_tlds[*]} " == *" $tld "* ]]; then
    return 0
  else
    return 1
  fi
}
fabriziosalmi commented 1 year ago
#!/bin/bash

echo "Sanitize blacklists 1/5"
sed -i "s/^0\.0\.0\.0//g" all.fqdn.blacklist

echo "Sanitize blacklists 2/5"
sed -i "s/^0\.0\.0\.0\ //g" all.fqdn.blacklist

echo "Sanitize blacklists 3/5"
sed -i "s/^127\.0\.0\.1//g" all.fqdn.blacklist

echo "Sanitize blacklists 4/5"
# Define the regular expressions to match IP addresses and IP ranges (IPv4 and IPv6)
ipv4_pattern='([0-9]{1,3}\.){3}[0-9]{1,3}'
ipv6_pattern='(([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){2}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){3}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){4}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,5})|:((:[0-9a-fA-F]{1,4}){1,6})|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))'
# Remove lines with IP addresses and IP ranges from the file
sed -i -E "/^$ipv4_pattern|^$ipv6_pattern/d" "all.fqdn.blacklist"

echo "IP addresses and IP ranges removed from the file: all.fqdn.blacklist"

echo "Sanitize blacklists 5/5"
# Define a function to check if the string is a valid FQDN
is_valid_fqdn() {
    local input_string="$1"
    local fqdn_regex='^[a-zA-Z0-9.-]+$'
    if [[ $input_string =~ $fqdn_regex ]]; then
        return 0
    else
        return 1
    fi
}

# Count the number of valid FQDNs
valid_fqdns=0

# Test the file specified as the first argument to the script
input_file="all.fqdn.blacklist"

# Name of the file to save the valid FQDNs
valid_fqdn_file="all.fqdn.blacklist.tmp"

# Check if the file exists
if [ ! -f "$input_file" ]; then
    echo "Error: File not found."
    exit 1
fi

# Read the total lines in the input file for progress tracking
total_lines=$(wc -l < "$input_file")
processed_lines=0

# Loop through each line of the file and check if it is a valid FQDN
while IFS= read -r line; do
    # Progress report
    ((processed_lines++))
    echo -ne "Progress: $((processed_lines * 100 / total_lines))% \r"

    # Check if the line contains invalid characters for an FQDN
    if [[ "$line" =~ [^a-zA-Z0-9.-] ]]; then
        continue  # Ignore this line
    fi

    # Check if the line is a valid FQDN and save it to the valid FQDN file
    if is_valid_fqdn "$line"; then
        ((valid_fqdns++))
        echo "$line" >> "$valid_fqdn_file"
    fi
done < "$input_file"

mv "$valid_fqdn_file" "$input_file"
echo "Valid FQDNs extracted, cleaned, and saved in all.fqdn.blacklist"
fabriziosalmi commented 1 year ago

[✓] Parsed 2342559 exact domains and 0 ABP-style domains (ignored 0 non-domain entries)