Closed edoardottt closed 1 year ago
Hi, I want to work on this. Assign it to me.
Done @milinddethe15, let me know if you have some doubts or you need guidance.
Hi @edoardottt, what do you mean by 'devel branches'?
Hi @edoardottt, what do you mean by 'devel branches'?
Sorry I'm working on multiple repos. There's only the main
branch here. Sorry for the mistake
Hi @edoardottt, There are already some of the duplicate links in README.md.
[ ERR ] DUPLICATE FOUND!
- [C99.nl](https://api.c99.nl/)
- [HackerTarget](https://hackertarget.com/ip-tools/)
- [IntelligenceX](https://intelx.io/)
- [PhoneBook](https://phonebook.cz/)
- [Rapid7 - DB](https://www.rapid7.com/db/)
- [RocketReach](https://rocketreach.co/)
- [SynapsInt](https://synapsint.com/)
- [Vulmon](https://vulmon.com/)
- [wannabe1337.xyz](https://wannabe1337.xyz/)
We need to fix this before running script in workflow.
Thanks @milinddethe15
1. How did you run the script?
If I run the script locally this is what I get:
$> ./scripts/check-dups.sh
[ OK! ] NO DUPLICATES FOUND.
350 links in README.
2. Clearly those are duplicate entries, but the fact is that they are okay... in the sense that they provide multiple services and so it's okay to have a single service providing e.g. dns and domain results.
As example:
cat README.md | grep Vulmon
- [Vulmon](https://vulmon.com/) - Vulnerability and exploit search engine
- [Vulmon](https://vulmon.com/) - Vulnerability and exploit search engine
There are two entries, but in different categories (one is vulnerability and the other is exploit)
3. The best solution would be to check if there are duplicates in each category. In that case the duplicated entry is an error.
Updated script:
#!/bin/bash
readme="README.md"
pwd=$(pwd)
if [[ "${pwd: -7}" == "scripts" ]];
then
readme="../README.md"
fi
# Function to extract links from a section and check for duplicates
check_section() {
section=$1
section_content=$(awk -v section="$section" '/^### / {p=0} {if(p)print} /^### '"$section"'/ {p=1}' "$readme")
duplicate_links=$(echo "$section_content" | grep -oP '\[.*?\]\(\K[^)]+' | sort | uniq -d)
if [[ -n $duplicate_links ]]; then
echo "[ ERR ] DUPLICATE LINKS FOUND IN SECTION: $section"
echo "$duplicate_links"
else
echo "[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: $section"
fi
}
# Get all unique section headings from the README file and handle spaces and slashes
sections=$(grep '^### ' "$readme" | sed 's/^### //' | sed 's/[\/&]/\\&/g')
# Call the function for each section
for section in $sections; do
check_section "$section"
done
$ ./scripts/check-dups.sh
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: General
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Search
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Engines
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Servers
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Vulnerabilities
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Exploits
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Attack
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Surface
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Code
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Mail
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Addresses
[ ERR ] DUPLICATE LINKS FOUND IN SECTION: Domains
https://spyonweb.com/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: URLs
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: DNS
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Certificates
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: WiFi
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Networks
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Device
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Information
[ ERR ] DUPLICATE LINKS FOUND IN SECTION: Credentials
https://bugmenot.com/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Leaks
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Hidden
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Services
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Social
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Networks
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Phone
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Numbers
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Images
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Threat
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Intelligence
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Web
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: History
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Surveillance
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: cameras
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Unclassified
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Not
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: working
awk: warning: escape sequence `\/' treated as plain `/'
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: \/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Paused
There are duplicate links in some category. I will fix them. Should I finalise this updated script?
Amazing! Yes, you can create a new issue for deleting duplicates and open a PR removing them.
We should also fix this part:
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Not
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: working
awk: warning: escape sequence `\/' treated as plain `/'
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: \/
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Paused
This should be treated as a single category Not Working / Paused
Also this:
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: General
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Search
[ OK! ] NO DUPLICATE LINKS FOUND IN SECTION: Engines
should be treated as a single category General Search Engines
Imo the script should always finish, but in the case duplicates are found it should exit with code 1
Sorry @edoardottt, bugmenot is not a duplicate. I created duplicate of bugmenot to run script and forgot to discard it.
Super, there is only one error to correct :)
Hi @edoardottt , In the previous script, I am not able to solve the issue of multi-word categories name which should be in a single category which you mentioned in your reply. (Please give your input in this eror) So, I have updated script in which if duplicate links is found it will print the duplicate link and exit with code 1.
readme="README.md"
pwd=$(pwd)
if [[ "${pwd: -7}" == "scripts" ]];
then
readme="../README.md"
fi
# Function to extract links from a section and check for duplicates
check_section() {
section=$1
section_escaped=$(sed 's/[&/\]/\\&/g' <<< "$section")
section_content=$(awk -v section="$section" '/^### / {p=0} {if(p)print} /^### '"$section"'/ {p=1}' "$readme")
duplicate_links=$(echo "$section_content" | grep -oP '\[.*?\]\(\K[^)]+' | sort | uniq -d)
if [[ -n $duplicate_links ]]; then
echo "[ ERR ] DUPLICATE LINKS FOUND"
echo "$duplicate_links"
exit 1
fi
}
# Get all unique section headings from the README file and handle spaces and slashes
sections=$(grep '^### ' "$readme" | sed 's/^### //' | sed 's/[\/&]/\\&/g')
# Call the function for each section
for section in $sections; do
check_section "$section"
done
Running this script:
$ ./scripts/check-dups.sh
awk: warning: escape sequence `\/' treated as plain `/'
gives this warning and am not able to resolve it.
Please give me your inputs on which script to use and its error. Thanks.
Sorry @milinddethe15 , open a pull request with the code you are trying to push and we can discuss better there. Here it's difficult using comments
Completed, thank you so much @milinddethe15 for your contribution!
Thank you @edoardottt ! I learned a lot about bash scripting and github actions in this issue.
For each PR/commit (both on main and devel branches) check if there are duplicate entries using a GitHub action.
Use
scripts/check-dups.sh
. You can edit it if needed