botlabio / deny-hosting-IP

A system for anyone to easily identify / block over 130,000,000 cloud and hosting company IP addresses.
35 stars 5 forks source link

source of list is opaque #1

Open jimdigriz opened 7 years ago

jimdigriz commented 7 years ago

A list of ~3000 CIDRs with no way to reconstruct the list or an explanation on where it come makes the list almost unusable. Someone would have to re-validate all the IPs which is as much effort as building your own datacenter/hosting list from scratch.

Without knowing this, an ad-tech company deploying this would be blindingly harming their advertising clients reach; especially as there are no recommendations such as this should be coupled to other signals for example before concluding "BOT BOT BOT" :)

Can an explanation be provided where this comes from and how to reconstruct and realistically re-evaluate the list regularly as IPv4 address reassignments are happening in the Real Life(tm).

Another list out there is github:client9/ipcat which though not perfect, does try to state where/why things are on the list.

My concern, having seen and consulted for various adtech companies, is that this list was bootstrapped from a heatmap/statistics/machine-learning model which would make of a large number of false positives in the list.

mikkokotila commented 7 years ago

In terms of reconstructing the list, on the README you can find detailed and many time tested instructions for how a person with minimal dev skills can expand it to a searchable MYSQL database of roughly 130 million IP addresses.

It is definitely recommended that datacenter IP filtering would be used as a compliment to other methods. In a typical scenario data center traffic represents a low single-digit percentile in a given dataset. There are far more substantial segments of invalid traffic present in most datasets.

When it comes to concluding “BOT”, we tried to make sure that the README and particularly the FAQ section would provide objective information covering the topic. It should be very clear that the fact that the visit comes from a datacenter does not mean it’s a “BOT”. It just means it’s from a datacenter.

The list was compiled using a simple method:

1) identify largest hosting companies in the world (at the moment top50) 2) remove companies that also provide ISP services 3) use public sources to identify the IP ranges of the companies 4) collate those IP ranges in to one list of CIDRs

There is zero use of algorithms.

I appreciate you having pointed out infos missing from the README. Based on your feedback I will also include a “known weaknesses” section at the top of the page.

It’s great that there are multiple efforts to create a list and mechanisms to deploy those lists easily without headaches. Actually we’ve used github:client9/ipcat https://github.com/client9/ipcat/ in our research as well.

@manigandham also has a list but I don’t he has it on github. It seems that his approach as I had understood was the most robust i.e. automatically sync up with the lists the big hosting companies themselves publish.

On Feb 25, 2017, at 11:46, Alexander Clouter notifications@github.com wrote: A list of ~3000 CIRRs with no way to reconstruct the list or an explanation on where it come makes the list almost unusable. Someone would have to re-validate all the IPs which is as much effort as building your own datacenter/hosting list from scratch.

Without knowing this, an ad-tech company deploying this would be blindingly harming their advertising clients reach https://en.wikipedia.org/wiki/Reach_(advertising); especially as there are no recommendations such as this should be coupled to other signals for example http://geocar.sdf1.org/browser-verification.html before concluding "BOT BOT BOT" :)

Can an explanation be provided where this comes from and how to reconstruct and realistically re-evaluate the list regularly as IPv4 address reassignments are real world things.

Another list out there is github:client9/ipcat https://github.com/client9/ipcat/ which though not perfect, does try to state where/why things are on the list.

My concern, having seen and consulted for various adtech companies, is that this list was bootstrapped from a heatmap/statistics/machine-learning model which would make of a large number of false positives in the list.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/botlabio/deny-hosting-IP/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AHk0FP_Ko0GFujdpAgMQSPaI6eaeLbQdks5rf_hvgaJpZM4ML9zI.

jimdigriz commented 7 years ago

This is great to know, thanks!

If it is not too much work, can you also add:

manigandham commented 7 years ago

Thanks @mikkokotila - we use our own tool for the major cloud IPs and combine it with the top ranges from the ipcat list. Can share code if necessary but it's not really open-source ready. The ipcat repo already has similar code too.

Let me know if I can help with anything else.

jimdigriz commented 7 years ago

'ready' rarely matters :-)

I have some Erlang to do AWS, Google, Microsoft and to also pull in some bogon lists...happy to share. I am more interested in picking out the sources your code pulls from rather than the actual implementation; much in the same way that I suspect folks will not much for our Erlang implementation :-)

What might interest folks is that we store the lookup as a range in a balanced tree[1] using the upper value as the key so you can check in one lookup if there is a match. Helpfully too is that it Just Works(TM) for IPv6.

[1] http://erlang.org/doc/man/gb_trees.html

On 7 April 2017 2:42:30 a.m. Mani Gandham notifications@github.com wrote:

Thanks @mikkokotila - we use our own tool for the major cloud IPs and combine it with the top ranges from the ipcat list. Can share code if necessary but it's not really open-source ready. The ipcat repo already has similar code too.

Let me know if I can help with anything else.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/botlabio/deny-hosting-IP/issues/1#issuecomment-292363448

mikkokotila commented 7 years ago

Thanks a lot, this is really great :) @jimdigriz sorry again for the delay, I had to take some time to think about where this project should go. I really appreciate the fact that you had invested in initiating this dialogue and the stuff that you had suggested, I will add it.

I guess the question in regards to the usefulness and adoption of such a solution will come down to as you had pointed out to the transparency and reliability of the method, and then how easy it's to roll out for an individual researcher doing one-off project, or leading adtech company to implement as part of their stack.

One criticism we hear at the trade body level is fragmentation. Because there are so many options, none really got the attention required for "maturity". IMO maturity is exactly what the industry needs for wide adoption. What do you think if start a new repo under new random org to address this, first by creating a README with everything that is out there already done, the methods that we know for identifying datacenter IPs, reference table to literature about the topic, etc. Then maybe that could lead naturally to more a wider "the cloudbot project". What do you think?