Zarcolio / uniqurl

Use uniqurl to filter only unique content from a list of URLs with stdin, making it usable within piped commands
GNU General Public License v2.0
5 stars 6 forks source link

Remove redundant urls #2

Closed adon90 closed 4 years ago

adon90 commented 4 years ago

Hello, let's say I have these two urls:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111

The unique content here would be http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111 but, for the moment, it keeps them both.

Imagine you run dalfox afterwards or other tool, you don't need http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view to retest the parameter "fuseaction" again.

Other case would have been:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&productId=123
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111

In that case you would need to keep them both.

Another thing is this:

I got this:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10

U run cat list.txt | uniqurl

And I got this:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS

The line http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2 gots deleted, I have lost the parameter "prod" in this case because there is not other url containing this parameter in that resource.

Regards!

Zarcolio commented 4 years ago

Hi adon90,

Thanks for using uniqurl and giving feedback about it :)

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111 The unique content here would be http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111

This script leaves out URLs with duplicate content. In this example, both have different content: "This is not an valid article" & "39790: Illegal Characters on Various Operating Systems". So both URLs are returned by the script.

The line http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2 gots deleted, I have lost the parameter "prod" in this case because there is not other url containing this parameter in that resource.

The script keeps the shortest URL tht is provided, because is it's more likely that long URLs have useless parameters in the GET request, especially if the URLs come from public resources like waybackurls. It would be very hard if not impossible to check which parameters have a further impact on the usage of the website.

I hope this answers your question? If not, help me understand ;)

Zarcolio commented 4 years ago

Haven't heard from you in a while so I'll be closing this issue. Let me know if you have further questions/issues :)