identify institutional affiliation of mail senders

sbenthall commented 10 years ago

For a mail sender, have an automated way of identifying their institutional affiliation.

Possible sources of information:

the email domain
email signatures
LinkedIn or other external data lookup

sbenthall commented 9 years ago

See #88 for signature detection issue

sbenthall commented 3 years ago

see also #20

sbenthall commented 3 years ago

Now we have ietfdata as an alternative data source for affiliations. This can be used for training data for machine learning or other entity resolution techniques.

sbenthall commented 3 years ago

These are the top 50 most frequent email domains for the senders to the working group httpbisa:

gmail.com 6530 mnot.net 4117 gmx.de 3606 phk.fre 1193 treenet.co 1006 qbik.com 843 henriknordstrom.net 790 google.com 732 chromium.org 700 gbiv.com 697 ietf.org 629 microsoft.com 614 intalio.com 420 apple.com 391 shareable.org 375 opera.com 366 greenbytes.de 329 mozilla.com 329 adambarth.com 313 ducksong.com 305 gmx.net 282 kerwin.net 266 briansmith.org 256 laposte.net 256 xpasc.com 255 cisco.com 246 w3.org 243 belshe.com 237 redhat.com 206 iaea.org 203 lukasa.co 203 cs.tcd 202 it.aoy 202 twitter.com 190 haxx.se 183 ericsson.com 161 crf.can 159 xyzzy.cla 150 cryptonector.com 146 zinks.de 134 elisanet.fi 127 rtfm.com 122 isode.com 120 computer.org 115 osafoundation.org 109 acm.org 108 bisonsystems.net 104 bbc.co 99 textuality.com 99 nygren.org 97

npdoty commented 3 years ago

Obviously IETF is an uncommon space compared to many in that several of these domain names are individuals I recognize, rather than companies or email services. (It's notable that Mark, one of the chairs, has almost as many emails as all gmail users combined!)

That some prominent individuals use their own domains in their email addresses also suggests that that limitation can be addressed by accessing their affiliation through other sources (affiliation listed in authored documents, or through a specific profile in the IETF datatracker).

sbenthall commented 3 years ago

I wonder if we can actually leverage the semantics of TLDs -- .com, .org, .net -- for our analysis.

It's a good point that there's a 1-to-many relationship between domains and people, but that it's a very uneven mapping. That coudl be a useful cue for whether a domain is for an organization, vs. a generic email host or a single person.

Extracting affiliation <-> email mappings from the datatracker would certainly help with this. The only issue is that it is a benefit that's restricted to the IETF domain. It could be used for training some other classifier though.

sbenthall commented 3 years ago

@npdoty @nllz I wonder if you could give me your gut check on this.

I've been working on a metric to guess whether a domain name corresponds to a significant organization vs. an individual.

When you eyeball this plot, do you see an intuitive difference between the domain names at the top of the plot vs. at the bottom of the plot?

What is that intuitive difference?

nllz commented 3 years ago

Hi Seb,

Thanks for this. The way I interpret this graph is the amount of different people that email from that specific domain.

For example: anviwalrusden.com is the personal domain of Andrew Sullivan: former IAB chair and very active long time IETF participant. Google, Oracle, Microsoft, Nokia, etc on the other end of the spectrum have a lot of different people working in IETF, and thus many different email address contributing to discussions from that domain.

Hope this helps.

Best,

Niels

On 03-03-2021 14:57, Sebastian Benthall wrote:

@npdoty https://github.com/npdoty @nllz https://github.com/nllz I wonder if you could give me your gut check on this.

I've been working on a metric to guess whether a domain name corresponds to a significant organization vs. an individual.

When you eyeball this plot, do you see an intuitive difference between the domain names at the top of the plot vs. at the bottom of the plot?

image https://user-images.githubusercontent.com/68752/109816144-67e67c00-7bfe-11eb-9725-aca78f1c26b5.png

What is that intuitive difference?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/datactive/bigbang/issues/25#issuecomment-789731775, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCWQ5Z7W647NDMTQRV7ZWTTBY537ANCNFSM4ARFTWQA.

-- Twitter : @nielstenoever PGP fingerprint : 8D9F C567 BEE4 A431 56C4 678B 08B5 A0F2 636D 68E9

sbenthall commented 3 years ago

Thank you. Yes, that's what I was hoping it would get at.

For these people like Andrew Sullivan and Mark Nottingham, do they use a personal domain to signify that they are participating in their personal capacity, or in their capacity as IETF chairs? I.e., do they not work for other organizations while performing these roles?

I was hoping we could use this metric to identify those email domains that are associated with organizations (Google, Oracle, Microsoft) with institutional involvement in IETF from individuals who have leadership positions. It does seem to be getting there.

One thing I find confusing is that the gmail.com and hotmail.com domains have a lower entropy (this metric) than google.com. I'm confused by this. It suggests that either google.com is used by a much more general population than gmail.com, or else that the company Google has more people working in this working group (httpbisa, in this case) than people with gmail addresses. Which is it?

This is the top 20 domains using this metric of involvement:

google.com         3.140787
gmail.com          2.984459
hotmail.com        2.553237
oracle.com         2.058373
w3.org             1.758979
yahoo.com          1.744146
microsoft.com      1.699436
akamai.com         1.585919
ericsson.com       1.512764
nokia.com          1.475076
us.ibm.com         1.430280
opera.com          1.395972
maebashi-it.org    1.273028
fb.com             1.241696
apple.com          1.192743
cisco.com          1.187929
ietf.org           1.159670
chromium.org       1.112767
cloudflare.com     1.088900
csail.mit.edu      1.054920

sbenthall commented 3 years ago

It looks like this may be because the gmail.com emails are actually dominated by a few individuals, making it a much more skewed distribution than google.com, which appears to spread responsibility for engaging the working group more evenly across its team. It's quite interesting, actually. Who is Martin Thompson?

sbenthall commented 3 years ago

This is the notebook where I've defined the 'domain entropy' metric and computed these results.

https://github.com/datactive/bigbang/blob/master/examples/organizations/Using%20Domain%20Entropy%20to%20Identify%20Organizations.ipynb

npdoty commented 3 years ago

It looks like this may be because the gmail.com emails are actually dominated by a few individuals, making it a much more skewed distribution than google.com, which appears to spread responsibility for engaging the working group more evenly across its team. It's quite interesting, actually. Who is Martin Thompson?

Yeah, while gmail.com may tend to be lots of different people with different affiliations when considered across all of IETF, say, it could be dominated by prominent individuals in one particular working group. Martin is the editor of the HTTP/2 spec and works for Mozilla: https://lowentropy.net/about/

npdoty commented 3 years ago

I'm currently looking at three data sources to match up organizational affiliations to an email address:

IETF meeting attendance records (useful, though not exhaustive, at least from recent meetings)
GitHub profiles (mostly not useful so far)
RFC/drafts authored by the person, which list addresses and affiliations (promising, at least for the most significant contributors, but haven't coded it yet)

Christovis commented 2 years ago

I was wondering whether this might be helpful to filter out personal contributions, and this to find affiliations (as suggested during the IAB-AID workshop.

npdoty commented 2 years ago

At the 111 hackathon, I started using this repo that collects email providers into a CSV: https://github.com/edwin-zvs/email-providers (And I think there are going to be many slightly similar data collection projects in this category, but that one at least has been updated in the last year and has a process for adding more domains, etc.)

But yes, it definitely shouldn't be done manually!

sbenthall commented 2 years ago

That gitdm tool seems to have designed a nice text based format for representing company/employer relations.

It's two kinds of files: the dev.txt files with entries of the form:

username: email, email, email organization from (start time) until (end time)

and then co.txt with entries:

company_name: username: email, email, email from (start_time) until (end time)

This seems pretty good.

In general, it seems like the data we need is a bipartite graph relating organizations and individuals, where each relation has a start time and an optional end time.

nllz commented 2 years ago

https://github.com/datactive/bigbang/pull/506/commits/54ef608e104693a942d2e8d03be915bb9e1eff2d

datactive / bigbang

identify institutional affiliation of mail senders #25