Closed laurenmarietta closed 1 year ago
Thanks for filing this issue!
I haven't worked much with the W3C ingress scripts before.
I was able to confirm that the change you mention make the W3C collection work better, and have committed it: 6a0ec35
I've also updated the docs to fix the example W3C ingest script: https://github.com/datactive/bigbang/blob/main/docs/datasets/mailinglists.rst
However, the W3C records that I've been able to ingest are missing the email body, and I haven't yet figured out how to retrieve them.
Can you please confirm that you can get past this issue with the main
branch, close this issue, and file any other problems or requests in separate issues?
Thanks for kicking the tires.
Actually, confirmed both that it works better (passing more tests) and that the missing body issue is new.
I was trying to run
collect-mail
on the provided list of W3C mailing lists and was confused as to why almost all of the URLs weren't returning files.It seems that, sometime between 2020 and now, W3C has subtly changed the HTML for its mailing list archives. From this Internet Archive page, they used to be listed in a
div
withclass=messages-list
, but today's version of the same URL lists them within amain
tag underclass=messages-list
. Because theW3CMailList
class explicitly looks for email records underdiv.messages-list
, the scans are coming up empty.Not sure if it's as simple a fix as replacing
div
withmain
or if there are more complicated things to consider here!