datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

Refine reading of LISTSERV 16.5 mailing lists #478

Open Christovis opened 3 years ago

Christovis commented 3 years ago

A message within a LISTSERV 16.5 mailing list has a header similar to:

From MAILER-DAEMON Wed Jun  2 22:45:36 2021
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
subject: EoM: TDoc List Update
from: blabla
reply-to: blabla
date: Fri, 28 May 2021 00:11:37 +0000
Content-Type: text/plain; charset="utf-8"; Content-Type="multipart/alternative"
Message-ID: wa.exe?A2=ind2105D&L=3GPP_TSG_SA_WG4&O=D&P=5708
Archived-At: <https://list.etsi.org/scripts/wa.exe?A2=ind2105D&L=3GPP_TSG_SA_WG4&O=D&P=5708>

A message with such a header can however contain nested messages which are in the 'reply-chain'. These messages can have a header of the form:

From: 3gpp_tsg_sa_wg4: tsg sa codec <3GPP_TSG_SA_WG4@LIST.ETSI.ORG> On Behalf=
 Of blabla
Sent: Thursday, May 27, 2021 9:55 PM
To: 3GPP_TSG_SA_WG4 <3GPP_TSG_SA_WG4@list.etsi.org>
Subject: TDoc List: Update

These nested messages are not capture when reading the .mbox file with mailbox.mbox(filepath, create=False).

Thus we need to think about a way how we can capture these nested messages.