Closed sbenthall closed 2 years ago
See #88 for signature detection issue
see also #20
Now we have ietfdata
as an alternative data source for affiliations. This can be used for training data for machine learning or other entity resolution techniques.
These are the top 50 most frequent email domains for the senders to the working group httpbisa
:
gmail.com 6530 mnot.net 4117 gmx.de 3606 phk.fre 1193 treenet.co 1006 qbik.com 843 henriknordstrom.net 790 google.com 732 chromium.org 700 gbiv.com 697 ietf.org 629 microsoft.com 614 intalio.com 420 apple.com 391 shareable.org 375 opera.com 366 greenbytes.de 329 mozilla.com 329 adambarth.com 313 ducksong.com 305 gmx.net 282 kerwin.net 266 briansmith.org 256 laposte.net 256 xpasc.com 255 cisco.com 246 w3.org 243 belshe.com 237 redhat.com 206 iaea.org 203 lukasa.co 203 cs.tcd 202 it.aoy 202 twitter.com 190 haxx.se 183 ericsson.com 161 crf.can 159 xyzzy.cla 150 cryptonector.com 146 zinks.de 134 elisanet.fi 127 rtfm.com 122 isode.com 120 computer.org 115 osafoundation.org 109 acm.org 108 bisonsystems.net 104 bbc.co 99 textuality.com 99 nygren.org 97
Obviously IETF is an uncommon space compared to many in that several of these domain names are individuals I recognize, rather than companies or email services. (It's notable that Mark, one of the chairs, has almost as many emails as all gmail users combined!)
That some prominent individuals use their own domains in their email addresses also suggests that that limitation can be addressed by accessing their affiliation through other sources (affiliation listed in authored documents, or through a specific profile in the IETF datatracker).
I wonder if we can actually leverage the semantics of TLDs -- .com, .org, .net -- for our analysis.
It's a good point that there's a 1-to-many relationship between domains and people, but that it's a very uneven mapping. That coudl be a useful cue for whether a domain is for an organization, vs. a generic email host or a single person.
Extracting affiliation <-> email mappings from the datatracker would certainly help with this. The only issue is that it is a benefit that's restricted to the IETF domain. It could be used for training some other classifier though.
@npdoty @nllz I wonder if you could give me your gut check on this.
I've been working on a metric to guess whether a domain name corresponds to a significant organization vs. an individual.
When you eyeball this plot, do you see an intuitive difference between the domain names at the top of the plot vs. at the bottom of the plot?
What is that intuitive difference?
Hi Seb,
Thanks for this. The way I interpret this graph is the amount of different people that email from that specific domain.
For example: anviwalrusden.com is the personal domain of Andrew Sullivan: former IAB chair and very active long time IETF participant. Google, Oracle, Microsoft, Nokia, etc on the other end of the spectrum have a lot of different people working in IETF, and thus many different email address contributing to discussions from that domain.
Hope this helps.
Best,
Niels
On 03-03-2021 14:57, Sebastian Benthall wrote:
@npdoty https://github.com/npdoty @nllz https://github.com/nllz I wonder if you could give me your gut check on this.
I've been working on a metric to guess whether a domain name corresponds to a significant organization vs. an individual.
When you eyeball this plot, do you see an intuitive difference between the domain names at the top of the plot vs. at the bottom of the plot?
image https://user-images.githubusercontent.com/68752/109816144-67e67c00-7bfe-11eb-9725-aca78f1c26b5.png
What is that intuitive difference?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/datactive/bigbang/issues/25#issuecomment-789731775, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCWQ5Z7W647NDMTQRV7ZWTTBY537ANCNFSM4ARFTWQA.
-- Twitter : @nielstenoever PGP fingerprint : 8D9F C567 BEE4 A431 56C4 678B 08B5 A0F2 636D 68E9
Thank you. Yes, that's what I was hoping it would get at.
For these people like Andrew Sullivan and Mark Nottingham, do they use a personal domain to signify that they are participating in their personal capacity, or in their capacity as IETF chairs? I.e., do they not work for other organizations while performing these roles?
I was hoping we could use this metric to identify those email domains that are associated with organizations (Google, Oracle, Microsoft) with institutional involvement in IETF from individuals who have leadership positions. It does seem to be getting there.
One thing I find confusing is that the gmail.com
and hotmail.com
domains have a lower entropy (this metric) than google.com
. I'm confused by this. It suggests that either google.com
is used by a much more general population than gmail.com
, or else that the company Google has more people working in this working group (httpbisa, in this case) than people with gmail
addresses. Which is it?
This is the top 20 domains using this metric of involvement:
google.com 3.140787
gmail.com 2.984459
hotmail.com 2.553237
oracle.com 2.058373
w3.org 1.758979
yahoo.com 1.744146
microsoft.com 1.699436
akamai.com 1.585919
ericsson.com 1.512764
nokia.com 1.475076
us.ibm.com 1.430280
opera.com 1.395972
maebashi-it.org 1.273028
fb.com 1.241696
apple.com 1.192743
cisco.com 1.187929
ietf.org 1.159670
chromium.org 1.112767
cloudflare.com 1.088900
csail.mit.edu 1.054920
It looks like this may be because the gmail.com
emails are actually dominated by a few individuals, making it a much more skewed distribution than google.com
, which appears to spread responsibility for engaging the working group more evenly across its team. It's quite interesting, actually. Who is Martin Thompson?
This is the notebook where I've defined the 'domain entropy' metric and computed these results.
It looks like this may be because the
gmail.com
emails are actually dominated by a few individuals, making it a much more skewed distribution thangoogle.com
, which appears to spread responsibility for engaging the working group more evenly across its team. It's quite interesting, actually. Who is Martin Thompson?
Yeah, while gmail.com
may tend to be lots of different people with different affiliations when considered across all of IETF, say, it could be dominated by prominent individuals in one particular working group. Martin is the editor of the HTTP/2 spec and works for Mozilla: https://lowentropy.net/about/
I'm currently looking at three data sources to match up organizational affiliations to an email address:
At the 111 hackathon, I started using this repo that collects email providers into a CSV: https://github.com/edwin-zvs/email-providers (And I think there are going to be many slightly similar data collection projects in this category, but that one at least has been updated in the last year and has a process for adding more domains, etc.)
But yes, it definitely shouldn't be done manually!
That gitdm tool seems to have designed a nice text based format for representing company/employer relations.
It's two kinds of files: the dev.txt files with entries of the form:
username: email, email, email organization from (start time) until (end time)
and then co.txt with entries:
company_name: username: email, email, email from (start_time) until (end time)
This seems pretty good.
In general, it seems like the data we need is a bipartite graph relating organizations and individuals, where each relation has a start time and an optional end time.
For a mail sender, have an automated way of identifying their institutional affiliation.
Possible sources of information: