Proof of Concept for Pepys in Jupyter

IanMayo commented 2 years ago

🐞 Overview

Produce a proof-of-concept for viewing Pepys data in a Jupyter notebook.

This will de-risk the future use of Jupyter notebooks both in Pepys and in general usage by analysts, offering lessons learned in data connectivity, data processing, and visualisation.

Time-permitting, to include:

connecting to Pepys (Postgres)
extracting State data for a period of time from one or more platforms
giving initial visualisations of
- spatial perspective (tracks)
- temporal perspective (speed vs time & range vs time)
data-pipeline to perform some cleaning of data (smooth speed?)
perform some SciKit-Learn analysis on data to produce some new calculated variable (TBD, see below)
push some calculated data back into Pepys
any thoughts on offline mapping (more below)

🔗 Feature

This represents an alternate solution for #859

🔢 Acceptance criteria

Machine Learning

SciKit provides capable clustering algorithms. But, we need to think of an application of this method to Pepys data

We could try to apply clustering to course and speed data (or change of course/speed) to identify "hi-tempo" periods for a platform, and shade the track accordingly.
Use frequency of comments to infer tempo of operations (note: I don't see this as M/L).
Some sentiment analysis/clustering of comments?

Offline mapping

Pepys will frequently be used without an Internet connection, unable to provide an OpenStreetMap backdrop. It would be useful to consider how a similar capability could be used to provide coverage in these areas of descending importance:

NW Scotland
All UK Waters
Europe
Global coverage

I guess some options are:

GeoTIFFs of UK Admiralty Charts
UK/Other Coastlines expressed in GeoJSON
Server-based solution such as OpenStreetMap

Sample analysis task #

Pick a track (primary) that is present for > 50% of the day
Randomly apply a dummy sensor activation / deactivation on that track (probably something like a ~1 hour cycle? Can vary to make a more interesting dataset...)
Assume the sensor has a range of X, select all the other tracks within range of X of primary while the sensor is active (again, can vary X to make the dataset more interesting)
Produce summary statistics and visualisations of the above. For example, a plot of the # tracks in range as a function of time, a histogram of distances from selected tracks to the primary or a spatial plot that lets you select each interaction. Focus would be on things you wouldn't do in debrief.
Test the process works by repeating using different days
An illustration of the concept the above could be used for is in Fisheries Protection. A platform may be suspected of Dark Fishing, and analysts wish to inspect the platforms interactions with others - to see if it may be offloading illegal catch. So, they may wish to view a timeline showing periods when other platforms are in vicinity of the suspect one. Clicking on an item on the timeline would show a map-plot of the interaction - to let them assess if the pattern of behaviour shows that catch was being offloaded. A similarly clickable table of CPAs (Closest Points of Approach - that's not the nearest point between two polylines, but a time-sensitive measurements of the closest point the two moving tracks got to each other) could also be useful. Whereas the above description uses periods of sensor coverage, in this example it could be periods of day/night, or periods with suitable sea-state. Fundamentally, there are temporal windows of interest that relate to the data.

Extended analysis task, considering bulk data #

Fictional high-level requirement. A government agency is interested in catching smugglers around the UK coastline. They have identified a handful of tracks of known smuggling vessels, but wish to identify occasions when other shipping could have passed within some distance of those vessels. A sample range is 2km.
The agency wishes to provide a quarterly report on this data. In a quarter it can typically get 20 x 24-hour smuggling vessel tracks, with a minute sample time.
The report should list the number of "close-encounters" for each smuggling vessels, with extended data on time-duration of close-encounter.
As sample data for this task, 24-hour datasets can be downloaded from: https://marinecadastre.gov/ais/
In the UK context, 24 hour of data is a 74MB .csv with ~2000 unique vessels in a 9x700,000 table. So, for a quarter of data there is a very large amount of data present.
As far as possible, the analysts would like the above process to be automated, so they are free to concentrate on the "close-encounters", and not spend time collating the data
It is accepted that this bulk-data challenge may introduce new technological challenges/requirements.

Prioritised subsequent tasks #

[x] Continue with examples of the ‘vessels within X distance of a specific vessel’ analysis, including tidying up and sharing outputs, plus adding other visualisations such as timelines
[x] Extend ‘vessels within X distance…’ example to deal with ‘periods of interest’ (eg. just during night time etc)
[ ] SQL implementation of Closest Point of Approach analysis
[ ] Think about and demo offline background mapping for use on non-internet-connected systems
[ ] Demonstrate GUI-based dataframe manipulation in a Jupyter notebook using various tools that we’ve discussed before - and showing how those could be used to do interesting quick analyses and could integrate with the rest of Pepys
[ ] Continuing PR to make pandas support SQLAlchemy 2.0 properly (very early PR at https://github.com/pandas-dev/pandas/pull/44794, needs quite a bit more work)
[ ] Create a ‘dashboard-style’ demo where you can interactively choose a vessel, choose distance parameters, view plots of timelines and then click to get maps - all built into a nice little demo application (probably using the tool ‘Panel’)

robintw commented 2 years ago

@IanMayo Here are some initial demos of a very simple notebook interface: Jupyter_1

Jupyter_2

There are loads of problems with this interface, but it's just an idea of what is possible with just a few lines of code. I'll put up a PR shortly so you can see the actual notebook code, and then I'll move on to some of the other stuff we wanted to demo.

robintw commented 2 years ago

See #1096 for a PR including this notebook code. I've also included some static and interactive plots of other variables - see:

and

Notably at the moment we have to work around pandas incompatibility with SQLAlchemy 2.0. This means that the SQLAlchemy 'engine' that we create in the Pepys DataStore won't work with pandas, as we create it using future=True (to use the new features and deprecations of SQLAlchemy 1.4, to make it ready for 2.0). There is a pandas issue for adding support for SQLAlchemy 2.0 (see https://github.com/pandas-dev/pandas/issues/40460), which seems to be stalled for lack of volunteers with the relevant experience - that might be something we could contribute to if you were interested.

robintw commented 2 years ago

Ah yes, one more thing:

Do you have any really good, realistic (ideally actually real - but not sensitive) data that I could use for playing around with developing analysis capabilities in Jupyter? Part of the reason I built the UI for selecting a platform and plotting the points was so that I could see if I could find a realistic looking track - a lot of the data on TracStor is obviously test data. The best I found was this HIPP platform, but it hasn't got a massive amount of data (only ~350 data points). If I were to start running scikit-learn models on data I'd ideally like something fairly realistic and reasonably large. Any ideas?

IanMayo commented 2 years ago

Aah, @robintw - from the depths of my memory I remembered where I'd seen a sample dataset, it's in the CSV files here: https://www.gov.uk/government/news/dstl-shares-new-open-source-framework-initiative

Some tracks appeared to have up to 3k points.

Obvs you'll either have to produce a parser to get the data into Pepys, or do some Excel column fiddling to make it look like an existing format which we parse. The "unknown platform" handling will be great for this data :-D

IanMayo commented 2 years ago

Here's another source of AIS data @Robin - it's a huge dataset, hopefully they're long tracks rather than just lots of small ones. https://marinecadastre.gov/ais/

IanMayo commented 2 years ago

@robintw - the analyst have come up with a useful analysis task ( above ) to "drive" the technical demonstrator. I'm happy to either expand the terms or rephrase the description as necessary for you to understand/implement it.

robintw commented 2 years ago

Thanks @IanMayo. That's an interesting task, and slightly different to what I was expecting. I'll have a ponder and do some experimentation and get back to you.

debrief / pepys-import