Sotera / newman

Quickly analyze and explore email with advanced analytics and visualization.
http://sotera.github.io/newman/
Apache License 2.0
55 stars 14 forks source link

Cannot ingest using GUI with Newman 2.1.3 #114

Closed abh3hu closed 6 years ago

abh3hu commented 7 years ago

I am trying to ingest .pst files from the Enron data set. I put the files in /home/vagrant/newman-ingester/TestPST/pst/TestPST1/.

I go to the GUI and click on "New Dataset..." but a window does not pop up. Upon inspecting the page, there are two errors that appear: 1) Uncaught ReferenceError: app_ingest_email is not defined 2) Uncaught ReferenceError: newman_domain_email is not defined

newman_ingest_error

smahoney58 commented 7 years ago

The pst files have to go in /vagrant/newman-ingester/TestPST/pst/TestPST1/ (not /home/vagrant/...). Let me know if this fixes your problem. Other option is to try to ingest from the command line.

abh3hu commented 7 years ago

I tried ingesting with command line. The data sets appear but none of the widgets have updated

newman no widgets

smahoney58 commented 7 years ago

Below are a few troubleshooting steps you can try. We will probably need the log files to determine the problem. Another option is to attach a pst file your trying to ingest.

  1. Sometimes elasticsearch doesn't start properly. Use the command line to: sudo service elasticsearch status (if it's not running then type in sudo service elasticsearch restart)
  2. Depending on the size of pst files (or mbox/emls). There may not be enough drive space available. From the command line enter: df -h (this will let you know how much drive space has been used).
  3. The log files for ingest are stored at: /srv/software/newman/work_dir/. We will probably need at least one set (i.e. .ingester.log, and .status.log) of these to determine the problem. You can cat the .status.log since its small. grep ERROR .ingester.log also has good information.
  4. May need to re-install VM from the start. Repeated installations can sometimes confuse VirtualBox and it forwards the ports incorrectly (below is what you should be seeing after doing a vagrant up). If your not seeing these ports, then exit, vagrant halt and then delete the C:\Users\John.Doe\VirtualBoxVMs.vagrant folder and the C:\Users\Scott.Mahoney\VirtualBoxVMs\vagrantfile. Then start from beginning with vagrant init.

==> default: Forwarding ports... default: 80 (guest) => 80 (host) (adapter 1) default: 443 (guest) => 443 (host) (adapter 1) default: 8787 (guest) => 8787 (host) (adapter 1) default: 9200 (guest) => 9200 (host) (adapter 1) default: 4040 (guest) => 4040 (host) (adapter 1) default: 3000 (guest) => 3000 (host) (adapter 1) default: 5984 (guest) => 5984 (host) (adapter 1) default: 5601 (guest) => 5601 (host) (adapter 1) default: 5000 (guest) => 5000 (host) (adapter 1) default: 22 (guest) => 2222 (host) (adapter 1)

abh3hu commented 7 years ago

I am trying to ingest a single PST file from the enron dataset. I have attached it.

swerzbin-m.zip

I am reinstalling the VM now

Here is my current status before reinstalling the VM:

  1. vagrant@vagrant-ubuntu-trusty-64:~$ sudo service elasticsearch status * elasticsearch is running
  2. I should have enough space for the pst file

vagrant@vagrant-ubuntu-trusty-64:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 2.5G 12K 2.5G 1% /dev tmpfs 502M 368K 502M 1% /run /dev/sda1 40G 26G 13G 68% / none 4.0K 0 4.0K 0% /sys/fs/cgroup none 5.0M 0 5.0M 0% /run/lock none 2.5G 0 2.5G 0% /run/shm none 100M 0 100M 0% /run/user

  1. I do not have any files in that folder

  2. Here are my ports:

$ vagrant up Bringing machine 'default' up with 'virtualbox' provider... ==> default: Clearing any previously set forwarded ports... ==> default: Clearing any previously set network interfaces... ==> default: Preparing network interfaces based on configuration... default: Adapter 1: nat ==> default: Forwarding ports... default: 8787 (guest) => 8787 (host) (adapter 1) default: 1337 (guest) => 1337 (host) (adapter 1) default: 9200 (guest) => 9200 (host) (adapter 1) default: 5601 (guest) => 5601 (host) (adapter 1) default: 4040 (guest) => 4040 (host) (adapter 1) default: 5984 (guest) => 5984 (host) (adapter 1) default: 22 (guest) => 2222 (host) (adapter 1)

smahoney58 commented 7 years ago

I'll try ingesting the pst file you zipped. The problem could be one of two things now.

The ENRON data was old and didn't follow the email format standard very well; also, most of the ENRON data I found publicly had been scrubbed with a lot of added dsrm statements. You could follow the steps to create a personal pst file and try ingesting it. Also, I'll try ingesting the attached pst to rule this situation out.

Second, I just ran into an issue very similar to this. I could start the ingest, the ingest name would show up but there were no contents. On my system, for whatever reason docker was inactive. After ssh into the VM, use the command "sudo service docker status" to see if its inactive. Use "sudo service docker start" to make it active.

smahoney58 commented 7 years ago

I started from scratch (i.e. downloaded newman-vm-v2.1.3.box from the link). I then followed the steps in the Quick Start guide (http://sotera.github.io/newman/quick-start/). I copied the specific Enron pst file you posted to the correct location C:\Users\jsmith\VirtualBoxVMs\newman-ingester\enron\pst\enrontest\swerzbin-m.pst (note: path may be slightly different for your machine based on what you called the case and label). I then used the gui to ingest the data. This added 341 emails to the Newman application. image

So, its not the file or the VM. That leaves memory/space issues (which from your post above doesn't seem like the problem) or corrupted vagrantfile from multiple attempts. You can delete or rename the existing Vagrantfile and .vagrant folder and regenerate them (i.e. vagrant init newman-vm-v2.1.3 newman-vm-v2.1.3.box, vagrant up, vagrant ssh, tangelo restart). Let me know if that works.

abh3hu commented 7 years ago

I was able to ingest the graph. Deleting the VM, Vagrant file, and .vagrant folder helped.

A small bug that happened during ingestion is that the VM would pause several times. I would need to unpause the VM for the ingestion to continue.

I clicked on a topic to see the list of Emails, but I do not see a graph.

newman no graph

smahoney58 commented 7 years ago

I'm seeing the same thing on my system. It looks like when this data was scrubbed to remove PII and add the EDRM messages, they deleted some important email format. Many of the From (senders) are shown as mike.swerzbin@enron.commike.swerzbin@enron.com. It's missing the semicolon separator. I'll need to look deeper in the data to see if its actually missing or our ingest process dropped the separator. With the From on most emails incorrect, the graph doesn't get built. If you want to see a graph, select the dataset Accounts ranked and then account mike.swerzbin@enron.com.

image

The ENRON dataset psts have been the most buggy for us to ingest. Having said that, I've ingested dozens of them with better results than this particular pst file. If you create your own or use more recent pst files, you will see a lot better results. We also handle email formats for mbox and emls.

I have never seen my VM just pause. I google 'why does virtualbox vm pause' and most of the articles hint that its a memory issue (either too much allocated in the VM or other gui type applications causing a resource conflict).

abh3hu commented 7 years ago

Do you have a dataset that I can download and test with, such as the Schiavo that was preloaded with newman-vm2.1.1? I would like to present an interesting dataset to other developers to show the value of the Newman Project

smahoney58 commented 7 years ago

Schiavo isn't all that good either. Some people got real excited on the "Right to Life" issue and most of the networks were real shallow. I like using the Jeb Bush dataset. There is both an mbox file and a set of emls. There's a lot of attachments, some with exif/geolocation information. Some of the better search terms include education, hurricane, and money. There are even a couple of Spanish emails where you can show translation capability. I've attached a zip file on my Dropbox. https://www.dropbox.com/s/folphh5172tmf54/jeb%40jeb.org_modified.zip?dl=0

smahoney58 commented 6 years ago

Closing issue - test email set delivered.