bellingcat / open-source-research-notebooks

Jupyter notebooks helping open source researchers, journalists, and fact-checkers use command line tools and code projects for digital investigations.
MIT License
171 stars 13 forks source link

Notebook: maigret #4

Closed msramalho closed 9 months ago

msramalho commented 10 months ago

Tool: https://github.com/soxoj/maigret Goals:

Any other ideas that seem relevant are welcome.

amithr commented 10 months ago

Hi Miguel,

I've made good progress on the Maigret Notebook and I have some questions.

1) How can I get access to possible tags rated to site genre? There's a link in the documentation to the source code saying that they are available there, but after scanning the linked code, I couldn't find them. 2) Is there anywhere I can find a list of site engines that can be used as tags? 3) Is there any content that you think should be added to the FAQ (which is a work in progress) or the notebook in general?

Here's a link .

Overall, I still have some proofreading and testing to do, but things are looking good so far.

msramalho commented 10 months ago

Hi Amith,

That's great to read!

1. and 2. About tags, I see that these are with a "Warning: tags markup is not stable now." but it's unclear what they mean by unstable.

This seems to be the list of sites and the tags they have: https://github.com/soxoj/maigret/blob/main/sites.md though it's not grouped by tags, you can also parse them from here: https://github.com/soxoj/maigret/blob/main/maigret/resources/data.json

  1. FAQs look good to be the rate-limiting is the most relevant challenge, in a similar fashion to holehe.
amithr commented 10 months ago

Hi Miguel,

I have another question. I have a section on querying emails and multiple emails. However, when I execute these queries, I often get the following message:

image

Particularly given that in such cases, the majority of websites return errors, do you think it's still worth mentioning this functionality?

msramalho commented 10 months ago

Ah! So I had thought that maigret would only work with usernames and not emails, but seing your notebook I assumed differently, but just went again through their readme and docs and there's no mention of using it for emails, it will only work for usernames. So that part of the notebook needs to be cut, and that should be why you are getting this error. Even if it works for 1/2 of the entries let's no include it since that's also the use case for holehe.

amithr commented 10 months ago

Hi Miguel,

I think I'm finished with the maigret notebook. Could you take a look and let me know if there are any changes I should make?

I removed any references to querying email addresses.

msramalho commented 9 months ago

Just went through the notebook, I like it and found it quite clear :100:

Some areas like the "Site engine tags" are not so clear to me but I think those are the deeper detalis that people can search on their own. my changes:

  1. made minor text edits across the document
  2. changed pip3 to pip, any reason to go back on this? the 1st time I ran it I had to restart the environment but later on it worked quite well.

Looks ready for a PR!

amithr commented 9 months ago

Sounds good!

To be honest, I don't really understand why searching by site engine tags was a useful feature, but I added it for completeness. Perhaps there's a niche use case that this comes in handy for.

msramalho commented 9 months ago

Fair assessment, I think it can help when you get 200 valid results which are hard to verify individually and decide to then filter by a tag that can possibly help you prioritize the results you get (either by filtering in or by filtering out) but it's not something I've done :)