datactive / bigbang

Scientific analysis of collaborative communities

http://datactive.github.io/bigbang/

MIT License

149 stars 52 forks source link

moving domaine entropy calculation from notebooks to library. fixes #529 #535

Closed sbenthall closed 2 years ago

sbenthall commented 2 years ago

Addresses #529

@Christovis among other things, this PR adds some functionality to the Archive class in bigbang.archive

It does this because it is caching some preprocessed variation of the data to make multiple retrieval less burdensome.

I wonder what you think of it and whether it it compatible with the new design proposed in #534

codecov-commenter commented 2 years ago

Codecov Report

Merging #535 (a371f1d) into main (359efa5) will decrease coverage by 0.50%. The diff coverage is 29.41%.

@@            Coverage Diff             @@
##             main     #535      +/-   ##
==========================================
- Coverage   74.83%   74.33%   -0.51%     
==========================================
  Files          22       22              
  Lines        3052     3086      +34     
==========================================
+ Hits         2284     2294      +10     
- Misses        768      792      +24

Flag	Coverage Δ
unittests	`74.33% <29.41%> (-0.51%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
bigbang/analysis/utils.py	`72.05% <25.00%> (-25.67%)`	:arrow_down:
bigbang/archive.py	`69.58% <40.00%> (-1.43%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 359efa5...a371f1d. Read the comment docs.

sbenthall commented 2 years ago

Latest commit addresses @Christovis 's comments.

I didn't work on the runtime because 2 * n is still O(n).

Christovis commented 2 years ago

I added a file containing samples of possible header address formats on which we can test the extract_email() and extract_domain() methods. If possible we could include these test in this PR too.

sbenthall commented 2 years ago

I've tried building an automated test suite for these functions based on this provided dataset.

As written, these functions will not pass tests based on that dataset.

In some cases, the extracted domain (with the utils function) disagrees in text case (ie. ALLCAPS.COM vs. lowercase.com) from the domain field in the test data.

I don't think we should count this as a failed case because email addresses are not case sensitive. So this would be a false negative.

It would be wiser, in my view, to normalize all email addresses to lower-case when they go through whatever minimal preprocessing we are using.

Also, the case where there are multiple email addresses in a single header is not going to be handled well by these functions yet.

As these are complexities that are important but somewhat orthogonal to the main thrust of this PR, I suggest we move them to a separate issue and merge this PR as is.