place pst files in pst-extract/pst/
bin/explode_psts.sh
- runs readpst to convert pst to mboxbin/normalize_mbox.sh
- mbox files to jsonbin/run_spark_tika.sh
- tika extract text of attachmentsbin/run_tika_content_join.sh
- join attachment text with email jsonbin/run_spark_content_split.sh
- removes base64 encoded attachment from emails json and puts the json in to a separate directory bin/run_spark_emailaddr.sh
- email address extraction and community assignmentbin/run_spark_email_community_assign.sh
- assign communities to email json objects bin/run_spark_topic_clustering.sh
- assign topic clustering to email json objects output by community assign bin/run_spark_mitie.sh
- Run MITIE to generate entities for email and add to email json generated by topic clusteringbin/run_spark_es_ingest_emailaddr.sh
- ingest emailaddrs to ES index bin/run_spark_es_ingest_attachments.sh
- ingest attachments to ES index bin/run_spark_es_ingest_emails.sh
- ingest emails with entities to ES index Location Extraction
Locations extracted from text
bin/build_clavin_index.sh
setup location index (only needs to be
run once)bin/run_location_extract.sh
extracts locations from text body
uses input from bin/run_spark_content_split
taskLocations extracted by IP
bin/setup_geo2ip.sh
setup geoip index bin/run_spark_originating_location.sh
extracts location from ip address
This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.