datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

Idea on abstraction of scraping classes #534

Closed Christovis closed 2 years ago

Christovis commented 2 years ago

This PR focuses on the scraping of mailing archives from W3C (bigbang/w3c.py) and Listserv (bigbang/listserv.py). As they have multiple methods in common, they can be abstracted out into a new bigbang/abstract.py file containing AbstractMessageParser, AbstractList, and AbstractArchive.

As a consequence:

This is part of the conversation in #512.

codecov-commenter commented 2 years ago

Codecov Report

Merging #534 (3ebffc7) into main (2c41b5c) will increase coverage by 0.76%. The diff coverage is 84.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #534      +/-   ##
==========================================
+ Coverage   74.24%   75.00%   +0.76%     
==========================================
  Files          22       27       +5     
  Lines        3075     3305     +230     
==========================================
+ Hits         2283     2479     +196     
- Misses        792      826      +34     
Flag Coverage Δ
unittests 75.00% <84.04%> (+0.76%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tests/unit/test_bigbang.py 100.00% <ø> (+2.11%) :arrow_up:
bigbang/ingress/mailman.py 61.61% <66.66%> (+0.19%) :arrow_up:
bigbang/ingress/utils.py 70.51% <70.51%> (ø)
bigbang/ingress/w3c.py 74.27% <74.27%> (ø)
bigbang/bigbang_io.py 82.11% <84.37%> (-0.29%) :arrow_down:
bigbang/ingress/abstract.py 85.55% <85.55%> (ø)
bigbang/ingress/listserv.py 83.28% <87.50%> (ø)
tests/ingress/test_listserv.py 90.62% <88.88%> (ø)
bigbang/analysis/listserv.py 63.71% <90.90%> (+1.01%) :arrow_up:
bigbang/__init__.py 100.00% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2c41b5c...3ebffc7. Read the comment docs.

sbenthall commented 2 years ago

Just skimming for now -- this looks like fantastic work, @Christovis . Thank you so much !

Since @npdoty wrote the original W3C scraper code and has used it in his own work, I'd suggest he review this.

sbenthall commented 2 years ago

Please add the ingress directory for the scraping code in this PR

sbenthall commented 2 years ago

@Christovis Please see inline comments for my review.

Excellent work! I had a few minor recommendations, mainly around naming and documentation.

sbenthall commented 2 years ago

I've engaged on a couple remaining points:

Happy to jump on a call to work this out if that's helpful.

Christovis commented 2 years ago

This is going to be a great PR! I have addressed the issues with words -- let me know what you think. If it's ok, then I will just polish it off a bit more and merge.

sbenthall commented 2 years ago

Yes, excellent! Thank you, this is awesome. Please feel free to merge when you like.