Sotera / newman

Quickly analyze and explore email with advanced analytics and visualization.
http://sotera.github.io/newman/
Apache License 2.0
55 stars 14 forks source link

Issues with Attachments #85

Closed smahoney58 closed 8 years ago

smahoney58 commented 8 years ago

Step through the following analysis to see all the issues.

  1. Select yipsusan.gmail.com dataset
  2. Select ic-world@myway.com email address
  3. Record results of Email tab for subject semiconductor purchase.
  4. Select Email Attachments tab and capture records for subject semiconductor purchase.
  5. As far as I can tell there are no real attachments. All the attach_0 and attach_1 files are just the email conversation in text format (attach_0) and html format (attach_1).
  6. Select each email from the 1st email to the last email. The 4th and 7th email are corrupted in the email view pane.
  7. If you then go to the attachments tab and open attach_0 in Word, the email is not corrupted. If you save attach_1 as attach_1.html and then open, then the email is not corrupted.
  8. The paired attach_0 and attach_1 have the same content.
  9. There are seven records listed in Email tab and eight records in Attachments tab. It looks like for this conversation only there should be 1 email sent from ic-world@myway.com to yipsusan@gmail.com 1 reply from yipsusan@gmail.com to ic-world@myway.com 1 reply back from ic-world@myway.com to yipsusan@gmail.com 1 reply back from yipsusan@gmail.com to ic-world@myway.com 1 reply back from ic-world@myway.com to yipsusan@gmail.com 1 reply back from yipsusan@gmail.com to ic-world@myway.com 1 last email from yipsusan@gmail.com to ic-world@myway.com. This email was different in that the attach_1 did not include the rolled up conversation; it just had a nice table. Unfortunately this email was shown as corrupted in email view pane. The associated attach_1 saved in html format and opened had a very nice table with parts bought. Maybe it was a new email that just used the same Subject “Re: semiconductor purchase”.

Potential Issues:

  1. Probably shouldn’t be any attachments for this conversation. Not sure about the last attachment though since the email was corrupted. If it somehow got uncorrupted and the table with list of parts sold was included in email body then there should not be any attachments for this conversation.
  2. The corrupted emails are most likely not a Big 5 encoding problem. Possibly embedded tables?

    Other issues seen with attachments:

• sales@empowerrf.com has four files with the generic indicator. Two of the four were identical and they were either corrupted, encrypted, or used a different encoding. The other two were pdf files that should have resolved as file type pdf. • Attachment results for search phrase “A small request to Henry” seems to corrupt the Subject for three of the emails in this chain. If you view the last email in the email chain, none of the Subject text is corrupted in the Email view pane. Only in the Email list is Subject corrupted. Second issue is the same as initial attachment analysis in that the attach_0 and attach_1 attachments probably should not be shown as attachments. They contain the contents of the email (attach_0 in Plain text and attach_1 in HTML). I doubt that the user wrote the email and then attached the email to itself.

smahoney58 commented 8 years ago

Closing issue. Tested this on Newman v2.1.1 with the Shiavo dataset. No longer seeing the attach_0 and attach_1 attachments. pdf attachments all seem to be shown as pdf. This issue had a lot of parts with the main issue being all the attach_0 and attach_1 files. Will re-open additional issues as new datasets are ingested. If yipsusan dataset is ever re-ingested, I'll revisit this issue to verify that all the issues logged here are fixed.