Add JSON functionality and console reporting of secrets, errors

ocervell commented 1 year ago

Would be great if the tool could take a -json option to output JSON like similar tools (katana, gospider, gau).

It could output JSON Lines that way and have a 'type' key for secrets and regex matches etc... instead of putting everything in a folder.

Example output:

{"source":"href","type":"url","output":"https://example.com/path/","status":403,"length":140}
{"source":"body","type":"url","output":"https://example.com/path2/","status":403,"length":140}
{"source":"https://example.com/path3.js?v=1652869476","type":"secretfinder","output":"<AWS_SECRET_KEY>","status":0,"length":0}

Thoughts ?

edoardottt commented 1 year ago

Hi @ocervell! Thanks to be interested in the project.
Of course I can add JSON output to cariddi, it's not rocket science and I think it can be useful too.
Unfortunately now I'm really busy with my thesis, work and other projects so I don't know when I will be able to push this new feature. Anyway thanks to point out this thing.

Anyone reading this, if you want to develop this feature just create a PR

Have a nice day

0xt3j4s commented 1 year ago

Is this issue still open? If yes, I'd like to work on it.

edoardottt commented 1 year ago

yes @Tezas-6174 it's open :)

I suggest to continue the work started in https://github.com/edoardottt/cariddi/pull/106. Maybe it would be better to add the flag -oj instead of -json just to be consistent with previous output methods:

-oh string
        Write the output into an HTML file.
-ot string
        Write the output into a TXT file.

the flag should take as input a string and output results like oh and ot do (separate files for results, secret etc. etc.)

Ask me for any doubt

0xt3j4s commented 1 year ago

Alright, I'll start working on it and let you know if I get stuck somewhere.

0xt3j4s commented 1 year ago

I had a doubt,

Here, in the New() function, the function visitHTMLLink() was not called correctly as 11 arguments are passed, although it takes only four arguments: the link, event, HTML element, and the collector. I changed it to 4 arguments, but I need to figure out what the event is.
If I get this, then the link will be visited, and the HTML and XML events will be registered.
Also, I have changed the option from -json to -oj.

edoardottt commented 1 year ago

have you forked the main or the dev branch? I think the dev is some commits forward, you should use that branch for new features

0xt3j4s commented 1 year ago

No, actually, the devel branch is three commits behind the main. I cloned this repo and pulled the changes in the unmerged pull request #106.

edoardottt commented 1 year ago

ok got it, the problem there is that that development branch has conflicts that must be resolved (frozen PR for a lot of time, the devel branch got updates in the meanwhile). It's better if you clone the devel branch and apply the changes on your own

ocervell commented 1 year ago

Sorry for the delay on this PR, didn't have much time on my hands and the linter wouldn't work locally... I can take over it if you want, the conflicts with main shouldn't be very hard to fix.

edoardottt commented 1 year ago

Sorry Olivier but time passed and Tezas wanted to work on this, you can try to fix the conflicts and realign the changes made with the requests in the previous messages. However, in my opinion is better to avoid "competition" on the same issue and try to distribute the workload on more devs :)
There is a lot of work to be done in this repo and I suggest to choose another issue or create new ones (even more than one! a lot of changes can be added). I will be very happy to help you contributing here

ocervell commented 1 year ago

Sure, no worries.

@Tezas-6174 please note that altough -oJ might be a good option for real JSON file output, imo it's a separate feature from this PR, as it should be distinguished from JSON lines, such as:

cariddi -oJ output.json would write a formatted JSON file, but cariddi -json could output JSON lines (in real-time) so that it's consumeable in real time by e.g jq, in a similar behavior as for example httpx or subfinder tools.

Both options are not mutually exclusive either, and we should be able to write cariddi -oJ output.json -json which would emit JSON lines on the console and save the output to a file(s) at the end of the run.

edoardottt commented 1 year ago

you know? you're right. However, there are some constraints/problems with this implementation.

cariddi can't output json lines as you designed them (including secrets and other stuff) in real-time because it parses all the stuff at the end of the crawling (only URLs are printed in real-time since it's impossible to have duplicates).
It would be better to have a single implementation of json output, then I agree it can be used both for -oj and -json options.

Regarding point 1, maybe it's possible but there could be duplicates in the results.

cc @Tezas-6174

ocervell commented 1 year ago

I respectfully disagree ;)

For 1. correct me if I'm wrong but we actually hunt errors, secrets and infos in the c.OnResponse function so it is accessible on a per-response basis and thus outputable (is this even a valid word ?) in real-time ! This branch has the implementation (and is up-to-date with devel in case @Tezas-6174 wants to pull from it).

For 2., imo JSON and JSON lines are two different output formats. That said, you could implement both with the same flag, such as when not passing anything to the -oJ flag it would output JSON lines, and when passing a file path it would save results in actual files instead.

I find JSON lines to be a great feature since being able to parse results in real-time to do further actions (pass to other tools like gf, jq, or save results to a NoSQL db) is a time-saver for sure. If you look at similar tools (ffuf, gobuster, katana), they all have a way to output real-time.

ocervell commented 1 year ago

I forgot to mention the eventual duplicates we could get: I didn't see any in the tests I ran, but if they were, maybe we could keep a cache of URLs visited and prevent re-crawling them (that's actually probably another feature).

edoardottt commented 1 year ago

My fault, you are correct for both points ahahaha.

I was mistaken with "final" output, e.g. if I find the same API key in two different URLs I just want to output once. Anyway using -json it's correct to print it whenever I see that.
I was thinking to the JSON creation, not the actual structs. But carefully thinking about that, it's so little code that it doesn't care. Or maybe I don't even have a clear high-level view of the implementation.
I'm pretty sure colly (the scraping core library used) has a feature for avoiding recrawling precrawled URLs, anyway we can test the feature.
For both -oj and -json surely we need tests.

@ocervell u can start working on that, if u have any issue ping me and I'll be very happy to discuss changes or anything

ocervell commented 1 year ago

No problems ! I have updated my PR to match with devel and pass the linting tests.

@Tezas-6174 , since this PR is pretty much done, maybe open a new PR for the -oj flag to save output to JSON files ?

Or work on improving it, there might be improvements that could be done that I've not yet seen.

edoardottt / cariddi

Add JSON functionality and console reporting of secrets, errors #103