As a developer on the fedora mailing list analysis project, I want to develop a webscraping tool that is consistent with the available mailing list archive. The OSPO team provided me with the data in the Mailman format, and I built a webscraping tool around it. As per comment in operate-first/continuous-deployment#38-comment , I want to determine if that is consistent with how the emails are currently archived.
UPDATE: After analyzing the data it has been found that the initial scraping from the OSPO team from the archives to the CSV originally provided corrupted the format of the text. Therefore, this made the distinction between thread text and user text incredibly difficult because of the lack of consistency in format. Now, the correct way to scrape this data
Acceptance Criteria:
[x] Investigate /discuss w/ OSPO team on
[x] the correct format of email archives - mailman vs hyperkitty and reason for providing one vs the other
[x] Determine if this is the cause of the inconsistent format of email text
[x] Research and write a script to scrape the data where the thread text is no included in the email body
As a developer on the fedora mailing list analysis project, I want to develop a webscraping tool that is consistent with the available mailing list archive. The OSPO team provided me with the data in the Mailman format, and I built a webscraping tool around it. As per comment in operate-first/continuous-deployment#38-comment , I want to determine if that is consistent with how the emails are currently archived.
UPDATE: After analyzing the data it has been found that the initial scraping from the OSPO team from the archives to the CSV originally provided corrupted the format of the text. Therefore, this made the distinction between thread text and user text incredibly difficult because of the lack of consistency in format. Now, the correct way to scrape this data
Acceptance Criteria:
[x] Investigate /discuss w/ OSPO team on
[x] the correct format of email archives - mailman vs hyperkitty and reason for providing one vs the other
[x] Determine if this is the cause of the inconsistent format of email text
[x] Research and write a script to scrape the data where the thread text is no included in the email body