datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
152 stars 52 forks source link

w3C mailman ingress fails due to HTML change #600

Closed laurenmarietta closed 1 year ago

laurenmarietta commented 1 year ago

I was trying to run collect-mail on the provided list of W3C mailing lists and was confused as to why almost all of the URLs weren't returning files.

It seems that, sometime between 2020 and now, W3C has subtly changed the HTML for its mailing list archives. From this Internet Archive page, they used to be listed in a div with class=messages-list, but today's version of the same URL lists them within a main tag under class=messages-list. Because the W3CMailList class explicitly looks for email records under div.messages-list, the scans are coming up empty.

Not sure if it's as simple a fix as replacing div with main or if there are more complicated things to consider here!

sbenthall commented 1 year ago

Thanks for filing this issue!

sbenthall commented 1 year ago

I haven't worked much with the W3C ingress scripts before.

I was able to confirm that the change you mention make the W3C collection work better, and have committed it: 6a0ec35

I've also updated the docs to fix the example W3C ingest script: https://github.com/datactive/bigbang/blob/main/docs/datasets/mailinglists.rst

However, the W3C records that I've been able to ingest are missing the email body, and I haven't yet figured out how to retrieve them.

Can you please confirm that you can get past this issue with the main branch, close this issue, and file any other problems or requests in separate issues?

Thanks for kicking the tires.

sbenthall commented 1 year ago

Actually, confirmed both that it works better (passing more tests) and that the missing body issue is new.