Add scripts/check-dups.sh as GitHub action

edoardottt commented 1 year ago

For each PR/commit (both on main and devel branches) check if there are duplicate entries using a GitHub action.

Use scripts/check-dups.sh. You can edit it if needed

milinddethe15 commented 1 year ago

Hi, I want to work on this. Assign it to me.

edoardottt commented 1 year ago

Done @milinddethe15, let me know if you have some doubts or you need guidance.

milinddethe15 commented 1 year ago

Hi @edoardottt, what do you mean by 'devel branches'?

edoardottt commented 1 year ago

Hi @edoardottt, what do you mean by 'devel branches'?

Sorry I'm working on multiple repos. There's only the main branch here. Sorry for the mistake

milinddethe15 commented 1 year ago

Hi @edoardottt, There are already some of the duplicate links in README.md.

[ ERR ] DUPLICATE FOUND!
- [C99.nl](https://api.c99.nl/)
- [HackerTarget](https://hackertarget.com/ip-tools/)
- [IntelligenceX](https://intelx.io/)
- [PhoneBook](https://phonebook.cz/)
- [Rapid7 - DB](https://www.rapid7.com/db/)
- [RocketReach](https://rocketreach.co/)
- [SynapsInt](https://synapsint.com/)
- [Vulmon](https://vulmon.com/)
- [wannabe1337.xyz](https://wannabe1337.xyz/)

We need to fix this before running script in workflow.

edoardottt commented 1 year ago

Thanks @milinddethe15

1. How did you run the script?

If I run the script locally this is what I get:

$> ./scripts/check-dups.sh 
[ OK! ] NO DUPLICATES FOUND.
350 links in README.

2. Clearly those are duplicate entries, but the fact is that they are okay... in the sense that they provide multiple services and so it's okay to have a single service providing e.g. dns and domain results.

As example:

cat README.md | grep Vulmon
- [Vulmon](https://vulmon.com/) - Vulnerability and exploit search engine
- [Vulmon](https://vulmon.com/) - Vulnerability and exploit search engine

There are two entries, but in different categories (one is vulnerability and the other is exploit)

3. The best solution would be to check if there are duplicates in each category. In that case the duplicated entry is an error.

milinddethe15 commented 1 year ago

Updated script:

#!/bin/bash

readme="README.md"

pwd=$(pwd)

if [[ "${pwd: -7}" == "scripts" ]];
then
    readme="../README.md"    
fi

# Function to extract links from a section and check for duplicates
check_section() {
    section=$1
    section_content=$(awk -v section="$section" '/^### / {p=0} {if(p)print} /^### '"$section"'/ {p=1}' "$readme")
    duplicate_links=$(echo "$section_content" | grep -oP '\[.*?\]\(\K[^)]+' | sort | uniq -d)

    if [[ -n $duplicate_links ]]; then
        echo "[ ERR ] DUPLICATE LINKS FOUND IN SECTION: $section"
        echo "$duplicate_links"
    else
        echo "[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: $section"
    fi
}

# Get all unique section headings from the README file and handle spaces and slashes
sections=$(grep '^### ' "$readme" | sed 's/^### //' | sed 's/[\/&]/\\&/g')

# Call the function for each section
for section in $sections; do
    check_section "$section"
done

$ ./scripts/check-dups.sh 
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: General
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Search
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Engines
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Servers
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Vulnerabilities
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Exploits
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Attack
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Surface
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Code
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Mail
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Addresses
[ ERR ] DUPLICATE LINKS FOUND IN SECTION: Domains
https://spyonweb.com/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: URLs
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: DNS
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Certificates
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: WiFi
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Networks
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Device
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Information
[ ERR ] DUPLICATE LINKS FOUND IN SECTION: Credentials
https://bugmenot.com/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Leaks
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Hidden
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Services
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Social
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Networks
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Phone
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Numbers
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Images
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Threat
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Intelligence
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Web
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: History
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Surveillance
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: cameras
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Unclassified
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Not
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: working
awk: warning: escape sequence `\/' treated as plain `/'
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: \/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Paused

There are duplicate links in some category. I will fix them. Should I finalise this updated script?

edoardottt commented 1 year ago

Amazing! Yes, you can create a new issue for deleting duplicates and open a PR removing them.

For the spyonweb we can delete the first one, while (I may be wrong on this) bugmenot is not a duplicate... am I wrong? In the credentials section there is only one entry for bugmenot.

We should also fix this part:

[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Not
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: working
awk: warning: escape sequence `\/' treated as plain `/'
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: \/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Paused

This should be treated as a single category Not Working / Paused

Also this:

[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: General
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Search
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Engines

should be treated as a single category General Search Engines

Imo the script should always finish, but in the case duplicates are found it should exit with code 1

milinddethe15 commented 1 year ago

Sorry @edoardottt, bugmenot is not a duplicate. I created duplicate of bugmenot to run script and forgot to discard it.

edoardottt commented 1 year ago

Super, there is only one error to correct :)

milinddethe15 commented 1 year ago

Hi @edoardottt , In the previous script, I am not able to solve the issue of multi-word categories name which should be in a single category which you mentioned in your reply. (Please give your input in this eror) So, I have updated script in which if duplicate links is found it will print the duplicate link and exit with code 1.


readme="README.md"

pwd=$(pwd)

if [[ "${pwd: -7}" == "scripts" ]];
then
    readme="../README.md"    
fi

# Function to extract links from a section and check for duplicates
check_section() {
    section=$1
    section_escaped=$(sed 's/[&/\]/\\&/g' <<< "$section")
    section_content=$(awk -v section="$section" '/^### / {p=0} {if(p)print} /^### '"$section"'/ {p=1}' "$readme")
    duplicate_links=$(echo "$section_content" | grep -oP '\[.*?\]\(\K[^)]+' | sort | uniq -d)
    if [[ -n $duplicate_links ]]; then
        echo "[ ERR ] DUPLICATE LINKS FOUND"
        echo "$duplicate_links"
        exit 1
    fi
}

# Get all unique section headings from the README file and handle spaces and slashes
sections=$(grep '^### ' "$readme" | sed 's/^### //' | sed 's/[\/&]/\\&/g')

# Call the function for each section
for section in $sections; do
    check_section "$section"
done

Running this script:

$ ./scripts/check-dups.sh 
awk: warning: escape sequence `\/' treated as plain `/'

gives this warning and am not able to resolve it.

Please give me your inputs on which script to use and its error. Thanks.

edoardottt commented 1 year ago

Sorry @milinddethe15 , open a pull request with the code you are trying to push and we can discuss better there. Here it's difficult using comments

edoardottt commented 1 year ago

Completed, thank you so much @milinddethe15 for your contribution!

milinddethe15 commented 1 year ago

Thank you @edoardottt ! I learned a lot about bash scripting and github actions in this issue.

edoardottt / awesome-hacker-search-engines

Add scripts/check-dups.sh as GitHub action #92