datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

add name/email/affilation timed relation extraction to notebook, see … #536

Closed sbenthall closed 2 years ago

sbenthall commented 2 years ago

This PR addresses #25 and #367

Currently just an implementation in a notebook, but it can be moved to the library if it looks more generally useful.

The main insight here is with respect to the data format for this relational data as per #367

The IETF attendance data is mined for three tables data, with the following columns:

Table B can then be used directly to solve issue #25 : for each email/timestamp in the email data, look up up the email on table B for its affiliation. If there is more than one row, find the row with the best matching time duration.

[Writing this matching script is another TODO for this PR]

Of course, a similar analysis could be done using (A) and matching on full names, though this is likely to have more noise than the free text name entry. The other data is extracted for completeness and to support further downstream entity resolution on names, email addresses, and affinities.

If this approach gets approval, then there are a few remaining issues for this PR:

codecov-commenter commented 2 years ago

Codecov Report

Merging #536 (915e0bc) into main (359efa5) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #536   +/-   ##
=======================================
  Coverage   74.83%   74.83%           
=======================================
  Files          22       22           
  Lines        3052     3052           
=======================================
  Hits         2284     2284           
  Misses        768      768           
Flag Coverage Δ
unittests 74.83% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 359efa5...915e0bc. Read the comment docs.

Christovis commented 2 years ago

I think this is great work and we should follow this approach. And the inference from the available data points to dates that are not explicitly represented in the data that is implemented is fine for now. It seems to follow the same method as described in Justus Baron & Olia Kanevskaia 2021. We can improve it once we see there is need to do so...

sbenthall commented 2 years ago

I'm not sure how to go about automated tests for functionality that wraps the IETF DataTracker, as that pings an external service and can run rather slowly.

sbenthall commented 2 years ago

Tests are passing. As I mentioned, I'm not sure the best way to proceed with testing of fucntionality that wraps the Glasgow IETF datatracker tool...

The notebook shows how to use the IETF attendance data to fill in affiliations for an archive. It has some functions in it that could potentially be moved to the library.

For now, I'll request a review of this current work.

sbenthall commented 2 years ago

Automated test for merging affiliation into Archive data is in. Request review from @Christovis