Closed fabriziosalmi closed 1 year ago
Let's add progress with pv
and more iterations to ensure a well-formatted final file with one FQDN per row. We'll modify the script to use multiple sed
commands for better handling of the cleaning process.
Install pv
if you haven't already using the package manager of your system (e.g., apt
, yum
, brew
, etc.). Then, update the script as follows:
#!/bin/bash
# Function to remove comments, extra content, and invalid lines
clean_fqdns() {
local input_file="$1"
local temp_file="${input_file}.tmp"
local cleaned_file="${input_file}.cleaned"
# Remove comments (lines starting with #) and progress with pv
pv "$input_file" | sed '/^\s*#/d' > "$temp_file"
# Remove extra content after FQDN in each line
pv "$temp_file" | sed 's/\([^[:space:]]*\).*/\1/' > "$cleaned_file"
# Remove lines not starting with a valid FQDN
pv "$cleaned_file" | grep -E '^[a-zA-Z0-9.-]+$' > "${input_file}.formatted"
# Clean up temporary files
rm "$temp_file" "$cleaned_file"
}
# Test the function with your file "input_file.txt"
clean_fqdns "input_file.txt"
echo "File cleaned and formatted as 'input_file.txt.formatted'"
Make the script executable with chmod +x clean_fqdns.sh
and run it with ./clean_fqdns.sh
.
Now the script will display progress using pv
during each step of the cleaning process, and it will save the final well-formatted file with one FQDN per row as input_file.txt.formatted
. The intermediate temporary files will be cleaned up automatically.
With this approach, the script should handle the input file more efficiently and produce the desired clean output.
#!/bin/bash
# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
local input_string="$1"
local fqdn_regex='^[a-zA-Z0-9.-]+$'
if [[ $input_string =~ $fqdn_regex ]]; then
return 0
else
return 1
fi
}
# Contatore per il numero di FQDN validi
valid_fqdns=0
# Testa il file specificato come primo argomento dello script
input_file="$1"
# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
echo "Errore: File non trovato."
exit 1
fi
# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=0
# Loop attraverso ogni riga del file e controlla se è un FQDN valido
while IFS= read -r line; do
((total_lines++))
if is_valid_fqdn "$line"; then
((valid_fqdns++))
fi
done < "$input_file"
# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
percentage=$((valid_fqdns * 100 / total_lines))
else
percentage=0
fi
# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe: $total_lines"
echo "Numero di FQDN validi: $valid_fqdns"
echo "Percentuale di FQDN validi: $percentage%"`
#!/bin/bash
# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
local input_string="$1"
local fqdn_regex='^[a-zA-Z0-9.-]+$'
if [[ $input_string =~ $fqdn_regex ]]; then
return 0
else
return 1
fi
}
# Contatore per il numero di FQDN validi
valid_fqdns=0
# Testa il file specificato come primo argomento dello script
input_file="$1"
# Nome del file per salvare i FQDN validi
valid_fqdn_file="valid_fqdns.txt"
# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
echo "Errore: File non trovato."
exit 1
fi
# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=0
# Loop attraverso ogni riga del file e controlla se è un FQDN valido
while IFS= read -r line; do
((total_lines++))
if is_valid_fqdn "$line"; then
((valid_fqdns++))
echo "$line" >> "$valid_fqdn_file"
fi
done < "$input_file"
# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
percentage=$((valid_fqdns * 100 / total_lines))
else
percentage=0
fi
# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe: $total_lines"
echo "Numero di FQDN validi: $valid_fqdns"
echo "Percentuale di FQDN validi: $percentage%"
echo "FQDN validi salvati in: $valid_fqdn_file"
#!/bin/bash
# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
local input_string="$1"
local fqdn_regex='^[a-zA-Z0-9.-]+$'
if [[ $input_string =~ $fqdn_regex ]]; then
return 0
else
return 1
fi
}
# Contatore per il numero di FQDN validi
valid_fqdns=0
# Testa il file specificato come primo argomento dello script
input_file="$1"
# Nome del file per salvare i FQDN validi
valid_fqdn_file="valid_fqdns.txt"
# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
echo "Errore: File non trovato."
exit 1
fi
# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=$(wc -l < "$input_file")
# Loop attraverso ogni riga del file e controlla se è un FQDN valido
pv "$input_file" | while IFS= read -r line; do
# Verifica se la riga inizia con un carattere commento o separatore
if [[ "$line" =~ ^[[:space:]]*([#!|].*|$) ]]; then
continue # Ignora questa riga
fi
((total_lines--)) # Decrementa il totale delle righe per il progresso di pv
if is_valid_fqdn "$line"; then
((valid_fqdns++))
echo "$line" >> "$valid_fqdn_file"
fi
done
# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
percentage=$((valid_fqdns * 100 / total_lines))
else
percentage=0
fi
# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe con FQDN: $valid_fqdns"
echo "Percentuale di righe con FQDN validi: $percentage%"
echo "FQDN validi salvati in: $valid_fqdn_file"
#!/bin/bash
# Funzione per verificare se una stringa contiene un FQDN valido
is_valid_fqdn() {
local input_string="$1"
local fqdn_regex='^[a-zA-Z0-9.-]+$'
if [[ $input_string =~ $fqdn_regex ]]; then
return 0
else
return 1
fi
}
# Contatore per il numero di FQDN validi
valid_fqdns=0
# Testa il file specificato come primo argomento dello script
input_file="$1"
# Nome del file per salvare i FQDN validi
valid_fqdn_file="valid_fqdns.txt"
# Controlla se il file esiste
if [ ! -f "$input_file" ]; then
echo "Errore: File non trovato."
exit 1
fi
# Inizia il conteggio delle righe e del numero di FQDN validi
total_lines=$(wc -l < "$input_file")
# Loop attraverso ogni riga del file e controlla se è un FQDN valido
pv "$input_file" | while IFS= read -r line; do
# Verifica se la riga contiene caratteri non validi per un FQDN
if [[ "$line" =~ [^a-zA-Z0-9.-] ]]; then
continue # Ignora questa riga
fi
((total_lines--)) # Decrementa il totale delle righe per il progresso di pv
if is_valid_fqdn "$line"; then
((valid_fqdns++))
echo "$line" >> "$valid_fqdn_file"
fi
done
# Calcola la percentuale di FQDN validi
if [ "$total_lines" -gt 0 ]; then
percentage=$((valid_fqdns * 100 / total_lines))
else
percentage=0
fi
# Stampa il numero e la percentuale di FQDN validi
echo "Numero totale di righe con FQDN: $valid_fqdns"
echo "Percentuale di righe con FQDN validi: $percentage%"
echo "FQDN validi salvati in: $valid_fqdn_file"
just ip addresses and/or ip ranges (ipv4/ipv6) left to remove from aggregated file
Certainly! We can further optimize the script to handle weird characters and cases by using more robust techniques for file handling and string manipulations. Additionally, we can improve the regular expressions used for FQDN validation. Here's the further optimized version:
#!/bin/bash
if ! apt-get install pv -y; then
echo "Error: Failed to install 'pv' package."
exit 1
fi
LISTS="blacklists.fqdn.urls"
download_url() {
local url="$1"
random_filename=$(uuidgen | tr -dc '[:alnum:]')
if ! wget -q --progress=bar:force -O "$random_filename.fqdn.list" "$url"; then
echo "Failed to download: $url"
fi
}
echo "Download blacklists"
while IFS= read -r url; do
download_url "$url"
done < "$LISTS"
echo "Aggregate blacklists"
cat *.fqdn.list > aggregated.fqdn.list
sort -u -o all.fqdn.blacklist aggregated.fqdn.list
rm *.fqdn.list
echo "Sanitize blacklists"
sed -i -E 's/^0\.0\.0\.0|^127\.0\.0\.1//' all.fqdn.blacklist
is_valid_fqdn() {
local input_string="$1"
local fqdn_regex='^[a-zA-Z0-9.-]+$'
if [[ "$input_string" =~ $fqdn_regex && "${input_string:0:1}" != "-" && "${input_string: -1}" != "-" ]]; then
return 0
else
return 1
fi
}
input_file="all.fqdn.blacklist"
valid_fqdn_file="all.fqdn.blacklist.tmp"
if [ ! -f "$input_file" ]; then
echo "Error: File $input_file not found."
exit 1
fi
total_lines=$(wc -l < "$input_file")
valid_fqdns=0
pv "$input_file" | while IFS= read -r line; do
if ! is_valid_fqdn "$line"; then
continue
fi
((total_lines--))
((valid_fqdns++))
echo "$line" >> "$valid_fqdn_file"
done
mv "$valid_fqdn_file" "$input_file"
tar -czf all.fqdn.blacklist.tar.gz "$input_file"
echo "Total domains: $total_lines, $valid_fqdns appears to be valid"
echo "Blacklist plain: https://github.com/fabriziosalmi/blacklists/blob/main/all.fqdn.blacklist"
echo "Blacklist compressed: https://github.com/fabriziosalmi/blacklists/blob/main/all.fqdn.blacklist.tar.gz"
In this version, we've made the following optimizations:
sed
command to use the -E
option for extended regular expressions, which simplifies the pattern for removing IP addresses from FQDNs.while
loop for FQDN validation, as it is now handled in the is_valid_fqdn
function.These optimizations should make the script more robust in handling various cases and weird characters in the domain names.
is_valid_fqdn improved
is_valid_fqdn() {
local input_string="$1"
local valid_tlds=("com" "org" "net" "edu" "gov" "mil" "int" "info" "biz" "name" "museum" "coop" "aero" "arpa" "asia" "cat" "jobs" "mobi" "post" "tel" "travel" "xxx" "ac" "ad" "ae" "af" "ag" "ai" "al" "am" "an" "ao" "aq" "ar" "as" "at" "au" "aw" "ax" "az" "ba" "bb" "bd" "be" "bf" "bg" "bh" "bi" "bj" "bm" "bn" "bo" "br" "bs" "bt" "bv" "bw" "by" "bz" "ca" "cc" "cd" "cf" "cg" "ch" "ci" "ck" "cl" "cm" "cn" "co" "cr" "cu" "cv" "cw" "cx" "cy" "cz" "de" "dj" "dk" "dm" "do" "dz" "ec" "ee" "eg" "er" "es" "et" "eu" "fi" "fj" "fk" "fm" "fo" "fr" "ga" "gd" "ge" "gf" "gg" "gh" "gi" "gl" "gm" "gn" "gp" "gq" "gr" "gs" "gt" "gu" "gw" "gy" "hk" "hm" "hn" "hr" "ht" "hu" "id" "ie" "il" "im" "in" "io" "iq" "ir" "is" "it" "je" "jm" "jo" "jp" "ke" "kg" "kh" "ki" "km" "kn" "kp" "kr" "kw" "ky" "kz" "la" "lb" "lc" "li" "lk" "lr" "ls" "lt" "lu" "lv" "ly" "ma" "mc" "md" "me" "mg" "mh" "mk" "ml" "mm" "mn" "mo" "mp" "mq" "mr" "ms" "mt" "mu" "mv" "mw" "mx" "my" "mz" "na" "nc" "ne" "nf" "ng" "ni" "nl" "no" "np" "nr" "nu" "nz" "om" "pa" "pe" "pf" "pg" "ph" "pk" "pl" "pm" "pn" "pr" "ps" "pt" "pw" "py" "qa" "re" "ro" "rs" "ru" "rw" "sa" "sb" "sc" "sd" "se" "sg" "sh" "si" "sj" "sk" "sl" "sm" "sn" "so" "sr" "ss" "st" "su" "sv" "sx" "sy" "sz" "tc" "td" "tf" "tg" "th" "tj" "tk" "tl" "tm" "tn" "to" "tp" "tr" "tt" "tv" "tw" "tz" "ua" "ug" "uk" "us" "uy" "uz" "va" "vc" "ve" "vg" "vi" "vn" "vu" "wf" "ws" "ye" "yt" "za" "zm" "zw")
local fqdn_regex='^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])\.([a-zA-Z]{2,}|[a-zA-Z]{2,}[a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])$'
local domain="${input_string%.*}"
local tld="${input_string##*.}"
if [[ "$input_string" =~ $fqdn_regex && "${input_string:0:1}" != "-" && "${input_string: -1}" != "-" && ${#input_string} -le 255 && " ${valid_tlds[*]} " == *" $tld "* ]]; then
return 0
else
return 1
fi
}
#!/bin/bash
echo "Sanitize blacklists 1/5"
sed -i "s/^0\.0\.0\.0//g" all.fqdn.blacklist
echo "Sanitize blacklists 2/5"
sed -i "s/^0\.0\.0\.0\ //g" all.fqdn.blacklist
echo "Sanitize blacklists 3/5"
sed -i "s/^127\.0\.0\.1//g" all.fqdn.blacklist
echo "Sanitize blacklists 4/5"
# Define the regular expressions to match IP addresses and IP ranges (IPv4 and IPv6)
ipv4_pattern='([0-9]{1,3}\.){3}[0-9]{1,3}'
ipv6_pattern='(([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){2}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){3}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){4}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,5})|:((:[0-9a-fA-F]{1,4}){1,6})|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))'
# Remove lines with IP addresses and IP ranges from the file
sed -i -E "/^$ipv4_pattern|^$ipv6_pattern/d" "all.fqdn.blacklist"
echo "IP addresses and IP ranges removed from the file: all.fqdn.blacklist"
echo "Sanitize blacklists 5/5"
# Define a function to check if the string is a valid FQDN
is_valid_fqdn() {
local input_string="$1"
local fqdn_regex='^[a-zA-Z0-9.-]+$'
if [[ $input_string =~ $fqdn_regex ]]; then
return 0
else
return 1
fi
}
# Count the number of valid FQDNs
valid_fqdns=0
# Test the file specified as the first argument to the script
input_file="all.fqdn.blacklist"
# Name of the file to save the valid FQDNs
valid_fqdn_file="all.fqdn.blacklist.tmp"
# Check if the file exists
if [ ! -f "$input_file" ]; then
echo "Error: File not found."
exit 1
fi
# Read the total lines in the input file for progress tracking
total_lines=$(wc -l < "$input_file")
processed_lines=0
# Loop through each line of the file and check if it is a valid FQDN
while IFS= read -r line; do
# Progress report
((processed_lines++))
echo -ne "Progress: $((processed_lines * 100 / total_lines))% \r"
# Check if the line contains invalid characters for an FQDN
if [[ "$line" =~ [^a-zA-Z0-9.-] ]]; then
continue # Ignore this line
fi
# Check if the line is a valid FQDN and save it to the valid FQDN file
if is_valid_fqdn "$line"; then
((valid_fqdns++))
echo "$line" >> "$valid_fqdn_file"
fi
done < "$input_file"
mv "$valid_fqdn_file" "$input_file"
echo "Valid FQDNs extracted, cleaned, and saved in all.fqdn.blacklist"
[✓] Parsed 2342559 exact domains and 0 ABP-style domains (ignored 0 non-domain entries)
To clean up the file as per your requirements, you can use the following bash script. This script will:
#
).Save the script in a file (e.g.,
clean_fqdns.sh
) and make it executable withchmod +x clean_fqdns.sh
. Place the script in the same directory as your input file (e.g.,input_file.txt
), and then run the script with./clean_fqdns.sh
.The script will clean up the file and create a new file called
input_file.txt.cleaned
, which will contain only one FQDN per row without any comments or extra content.