deryk96 / pirates-of-monterey

An Analysis on Trends of Piracy in Major Shipping Lanes
1 stars 0 forks source link
bokeh copernicus jupyter-notebook natural-language-processing nlp pandas piracy python python3 spacy

Pirates of Monterey: An Analysis on Trends of Piracy in Major Shipping Lanes

Contents

File/Folder Overview

Background

The resurgence of piracy activities in vital shipping lanes poses a significant threat to maritime traffic, necessitating an initiative-taking approach to mitigate risks and ensure the safety of vessels, cargo, and crew. Our project's main goal was to discern patterns and correlations within piracy incidents to aid in the formulation of proactive measures for mitigating piracy risks. By examining a ship's profile, including its location, country of affiliation, and type of vessel, we aimed to identify indicators that increase the probability a vessel is targeted for piracy. Stakeholders such as coastal authorities, shipping companies, and naval operations stand to benefit from actionable insights derived from this analysis, facilitating the protection of maritime interests and safe navigation across high-risk regions.

Data Sources

Data Curation and Computation

The curation process was a multi-step process. We first used Natural Language Processing (NLP) models, found in the Python package named spaCy, to extrapolate information from the strings in the dirty IMO data set. From these strings, we sought to determine whether a ship was boarded, whether it was hijacked, and the consequences to the crew (hostages or assaults). To extract the desired data from the strings, we began by using spaCy’s token-based matching to establish a rule-based model. The model was tested on training data (a portion of the data that we manually labeled as boarded, hijacked, etc.) and refined based on the results of the test. The extraction of “boarded” labels necessitated 17 rules, while “hijacked” labels only required two. When rule-based matching was tested on the full dataset, it did not perform well. There were far too many false positive “boarded” labels to accept it as the solution. In the end, the superior solution we found was to train a custom Span Categorizing NLP Model to find and categorize portions of each sentence in the data set with the applicable labels. The model training took 1 hour and 11 minutes for the loop to iterate through training data and develop an accurate statistical model to correctly identify the labels. Once applied to the full dataset of 8,556 strings, the model took 4.5 minutes to process and label each incident. Subsequently, we merged the data from the IMO Vessel Codes dataset and COCOM Countries dataset to extrapolate a country (ship flag) for each ship. Finally, the latitude and longitude were converted into a usable format using regular expressions (REGEX) and a function to convert from degrees/minutes/seconds to decimal. We opted to drop any incidents that had null or missing values for latitude and longitude, as we considered this a crucial parameter for analysis.

Analysis

Our initial step was to visually scrutinize the data from each data frame (dirty and clean), employing tools such as Folium maps, MatPlotLib, Bokeh, Streamlit, and Seaborn plots to represent the data. This visual representation facilitated our decision-making process regarding the factors we wished to delve deeper into. Observing a high concentration of incidents in specific locations, we decided to concentrate our analysis on three regions with the highest incident density: the Strait of Malacca, Gulf of Aden, and Gulf of Guinea. To segregate the data into these three categories, we defined latitude and longitude boundaries for each region of interest and applied them to both data frames, thereby partitioning the data based on geographical location. Our aim was to investigate whether the incidents in these areas exhibited higher severity levels, leading us to classify each incident into one of four categories:

Lessons Learned

Future Work