bellingcat / open-source-research-notebooks

Jupyter notebooks helping open source researchers, journalists, and fact-checkers use command line tools and code projects for digital investigations.
MIT License
171 stars 13 forks source link

Notebook: holehe #3

Closed msramalho closed 9 months ago

msramalho commented 10 months ago

Tool: https://github.com/megadose/holehe Goals:

Any other ideas that seem relevant are welcome.

amithr commented 10 months ago

I'm working through creating a notebook for the holehe tool and I think I may I found a bug. The .csv functionality doesn't seem to work. I tried to export the results to a .csv using the "--csv" flag both in Jupyter notebooks and on my local machine. This is the result:

Screen Shot 2023-08-21 at 10 03 45 AM

The results are successfully printed in the terminal, but the script throws an error when it comes to actually exporting a .csv. Can you confirm that this is actually a bug?

msramalho commented 10 months ago

I tested it locally without an issue, and it's the same version as visible in your traceback 1.61 .

My guess is it will be an issue specific to your OS or the way the installation was made, but the good thing about using the Colab (etc) environments is that these types of errors should not happen there.

msramalho commented 10 months ago

one note on exporting data as files, is that you need to use Colab's filesystem to download them.

For future if you want to download a folder's contents you need to zip it first and then download the zip.

amithr commented 10 months ago

So strangely I receive the same error in Colab. I can see the correct terminal output, but then I receive this error related to the .csv functionality. This isn't the full error message, but more or less the same as what I saw on my Mac.

image
msramalho commented 10 months ago

Can you check this colab notebook?

Try it and also add the code that is not working for you, it might be that there is indeed an edge case in holehe and we can open an issue there if that's the case.

amithr commented 10 months ago

For some reason when I ran this code, the .csv wasn't generated: %% shell holehe --csv test@gmail.com as opposed to the code below: !holehe --csv test@gmail.com

Both result in the same error message, but the second block still generates the .csv. It's weird as theoretically they should result in the same output.

I realized that it did also generate a .csv on my local machine, the error message just made me assume that it wasn't working.

The output is a bit misleading, but it seems that the functionality still works.

msramalho commented 10 months ago

Ah, thanks for the update. I actually wasn't aware of the %% shell option so it's curious that it doesn't have the exact same meaning as !.

I'd say let's proceed with that and if necessary for the users to understand what's going on we can add a note or open an issue on the holehe repository.

amithr commented 10 months ago

Two questions: 1) How would someone get results for a list of email addresses and store them in a .csv? Should I be looking for a shell-based solution? I can't see this functionality in the codebase for the app or the documentation.

2) Should I be documenting/demonstrating all of the following options? I'm wondering how relevant some of these would be to the average not-so-technical user.

image
msramalho commented 10 months ago
  1. Holehe does not seem to accept multiple emails, so we can use something like !echo "email1@example.com email2@example.com" | xargs -n 1 holehe to call it multiple times. Maybe this will make more sense with the --only-used to make it clear, but a possible challenge will be merging the output csvs
  2. Let's be pragmatic about it rather than comprehensive, I think most of those are not that relevant for someone who only wants to check the emails, maybe only --only-used (for readability) and --csv . if you feel like it, the --timeout could also be useful for more savvy users
amithr commented 10 months ago

A few more questions: 1) With regards to how the application itself works, I understand that the password recovery functionality is used to figure out whether there is an account associated with the target email address. Given that that's the case, why is the --no-password-recovery option present?

2) I absolutely cannot figure out how to use the --timeout flag. Here's an example of how I've been trying to use it and what the output has been. Basically, almost all websites are shown as being rate-limited, despite the fact that the results look very different when the --timeout flag isn't used.

image

3) I am mentioning that rate limiting is an issue and that one solution to this issue, on a local machine, is to use a VPN. Are there any possible solutions you can think of for Google Colab? In general, I don't see how it's possible to "change" the IP address from which you are making a request on a hosted service.

4) Is there anything to say about rate-limiting besides the fact it exists and what a solution is? I think the number of requests that can be made before getting rate-limited is fairly arbitrary and depends on the website in question.

msramalho commented 10 months ago

Hey Amith,

  1. looking at the code it seems they use that option to exclude (when --no-password-recovery is set) some platforms like the ones containing: adobe, mail_ru, odnoklassniki, samsung. The guess is that the other platforms are checked in other ways like trying to create a new account. Password recovery attempt is probably triggering an email to the user so this option should prevent the user from being notified.
  2. I looked at the code and it should indeed trigger the X on rate-limit only so unless you can further dig and find a justification there I'd exclude it for now, I imagine there's a bug that gets timeouts confused with rate-limits perhaps
  3. You can still use a vpn in a hosted service or probably more efficient would be a residential proxy but none of those solutions is cheap and they need a specific vendor so I'd prefer that we mention the existence of those solutions, and maybe even welcome people contributing new code cells if they have them but not do that now, also as it's no so trivial and can consume a lot of time.
  4. Not much more indeed, perhaps some external links? You are correct that it is website-dependent, and I'm strongly assuming that IP as well, but going around that is out of the context of this notebook. Perhaps we can think of a new one in the future that is dedicated to demonstrating those circumvention strategies.
amithr commented 10 months ago

Thanks for the information. That was really helpful.

I think I have a very rough draft of the Jupyter Notebook for holehe ready. I'm not sure if it's what you're looking for, so any and all feedback you could give me would be great.

Here's a link. It's also in the folder you created.

msramalho commented 10 months ago

Hi Amith, just had a look and it looks great! :tada:

Here are some notes:

  1. The descriptions read easily and you made a good use of markdown features.
  2. I like how you included an external service for performing a similar task!
  3. when I try to run the cells I get a warning about the content creator, this is normal but maybe we can create an alias email to avoid revealing our personal/work emails to anyone who access the documents later, but that can be done at a later stage, here's what I mean:

image

  1. Try not to include your personal email, I see you used sometest@gmail.com in some examples so probably standardize those, or even suggest some very visible ones like trump@gmail.com (not sure it's his :])

  2. for the text that says In the case of this notebook, just click on the folder icon to get access to your .csv, which can be downloaded and opened using Microsoft Excel, or any other spreadsheet-based program. and similar ones I'd add a note or tag saying it refers specifically to Colab and not "this notebook", since other jupyter environments may be different and ideally we will only maintain one notebook that works for all of them. Unless I'm guessing wrong and the main ones are like that.

  3. I fixed a couple typos and added some info to the rate-limit FAQ.

Nothing to report/ask apart from the above, I think after these minor updates we can consider this notebook done and commit it to the repo to close this issue :)

amithr commented 10 months ago

Hi Miguel,

Thanks for the feedback! I fixed the issues that you mentioned (3 & 4, excluding setting up an alias :)).

My plan is to work on maigret next, as it has some similarities to holehe. I'd like to get a few of these notebooks under my belt before contributing to the overall guide on how to actually use the Jupyter Notebooks.

msramalho commented 10 months ago

Hi Amith, Thanks for that. I've opened this issue so we don't forget to fix the alias detail: https://github.com/bellingcat/open-source-research-notebooks/issues/8 Maigret sounds like a good followup here.

I planned to start the draft for the article this week, but I will defer it to next week. I'll first take a stab myself and then maybe you can see my whole picture perspective and complement it with yours.

msramalho commented 10 months ago

So final step to close this issue is to add the notebook to the github repository and to the README.

amithr commented 10 months ago

So final step to close this issue is to add the notebook to the github repository and to the README.

This may be a stupid question, but are we adding the notebook to the holehe repository or to the open-source-research-notebooks repository?

msramalho commented 10 months ago

To this one :)

We may at some point open an issue in these other repos to see if they want to include a link or the notebook itself, but only after it's public.