Open IanMayo opened 2 years ago
@IanMayo Here are some initial demos of a very simple notebook interface:
There are loads of problems with this interface, but it's just an idea of what is possible with just a few lines of code. I'll put up a PR shortly so you can see the actual notebook code, and then I'll move on to some of the other stuff we wanted to demo.
See #1096 for a PR including this notebook code. I've also included some static and interactive plots of other variables - see:
and
Notably at the moment we have to work around pandas incompatibility with SQLAlchemy 2.0. This means that the SQLAlchemy 'engine' that we create in the Pepys DataStore won't work with pandas, as we create it using future=True
(to use the new features and deprecations of SQLAlchemy 1.4, to make it ready for 2.0). There is a pandas issue for adding support for SQLAlchemy 2.0 (see https://github.com/pandas-dev/pandas/issues/40460), which seems to be stalled for lack of volunteers with the relevant experience - that might be something we could contribute to if you were interested.
Ah yes, one more thing:
Do you have any really good, realistic (ideally actually real - but not sensitive) data that I could use for playing around with developing analysis capabilities in Jupyter? Part of the reason I built the UI for selecting a platform and plotting the points was so that I could see if I could find a realistic looking track - a lot of the data on TracStor is obviously test data. The best I found was this HIPP
platform, but it hasn't got a massive amount of data (only ~350 data points). If I were to start running scikit-learn models on data I'd ideally like something fairly realistic and reasonably large. Any ideas?
Aah, @robintw - from the depths of my memory I remembered where I'd seen a sample dataset, it's in the CSV files here: https://www.gov.uk/government/news/dstl-shares-new-open-source-framework-initiative
Some tracks appeared to have up to 3k points.
Obvs you'll either have to produce a parser to get the data into Pepys, or do some Excel column fiddling to make it look like an existing format which we parse. The "unknown platform" handling will be great for this data :-D
Here's another source of AIS data @Robin - it's a huge dataset, hopefully they're long tracks rather than just lots of small ones. https://marinecadastre.gov/ais/
@robintw - the analyst have come up with a useful analysis task ( above ) to "drive" the technical demonstrator. I'm happy to either expand the terms or rephrase the description as necessary for you to understand/implement it.
Thanks @IanMayo. That's an interesting task, and slightly different to what I was expecting. I'll have a ponder and do some experimentation and get back to you.
š Overview
Produce a proof-of-concept for viewing Pepys data in a Jupyter notebook.
This will de-risk the future use of Jupyter notebooks both in Pepys and in general usage by analysts, offering lessons learned in data connectivity, data processing, and visualisation.
Time-permitting, to include:
State
data for a period of time from one or more platformsš Feature
This represents an alternate solution for #859
š¢ Acceptance criteria
Machine Learning
SciKit provides capable clustering algorithms. But, we need to think of an application of this method to Pepys data
Offline mapping
Pepys will frequently be used without an Internet connection, unable to provide an OpenStreetMap backdrop. It would be useful to consider how a similar capability could be used to provide coverage in these areas of descending importance:
I guess some options are:
Sample analysis task #
Extended analysis task, considering bulk data #
Prioritised subsequent tasks #