debrief / pepys-import

Support library for Pepys maritime data analysis environment
https://pepys-import.readthedocs.io/
Apache License 2.0
5 stars 5 forks source link

Investigate performance issues with large volume datasets #922

Open IanMayo opened 3 years ago

IanMayo commented 3 years ago

Analysts are inquisitive about what kinds of dataset can be supported in Pepys.

In particular, they want to know what volume of data can be loaded into Pepys without an impact on INSERT or QUERY performance.

We could investigate this by pushing high volumes of data in Pepys.

I've produced fictional GPS data that we've read in using Pepys-Import. There were around 1.5M rows of data. There were processing bottlenecks in producing the data, parsing the data-files, and uploading the data to the AcuGis hosting.

An alternate strategy would be to generate the data in PostGis stored procedures - since the data will already be on the AcuGis server. The data won't have to look like GPS data, it would just have to use representative columns, and use data values within the normal ranges.

Scale? Maybe something like this: image

Run series of performance tests, with database containing data for:

rnllv commented 3 years ago

This sql should help in populating any amount of data. Please let me know if you need any more customizations.

Input configuration is as below.

{
                "frequencyPerTrack": 1,
                "trackLengthMinutes": 10,
                "tracksPerDay": 2,
                "trackDays": 3,
                "trackStartDate": "2017-11-30",
                "trackGapMinutes": 2,
                "sensorIds": ["a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a01","a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11"],
                "sourceIds": ["a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a14","a0eebc99-9c0b-4ef8-bb6d-6bb4bd380a11"],
                "comments": "For data load test - WARGM3 2021 Jan"
}
IanMayo commented 3 years ago

Thanks @arunlal-v

While the parameter says "trackGapMinutes" the interval on the track appears to be in seconds. I guess it's best to rename the parameter to seconds.

Aah, one other thing. We need more fields in the results - so we can query them as if they are real data. So, we need these State fields:

There is some code here that generates random tracks. I guess the config file should contain a bounds array that gives two locations that represent top-left/bottom-right for the rectangular bounds for the track start locations.

Aah, I don't necessarily think we need more Platforms in this trial data, or even Sensors. But, we will need lots of Sources(datafiles) - since in real life this high volume of data will have come from lots of datafiles (probably one per track).

rnllv commented 3 years ago

While the parameter says "trackGapMinutes" the interval on the track appears to be in seconds. I guess it's best to rename the parameter to seconds.

This is gap between tracks. Suppose there are 10 tracks in a day, the start times of these tracks will be these many seconds apart. The interval between sensor data within a given track is defined by frequencyPerTrack. I see the namings are little confusing here. That can be changed to whatever we want.

Aah, one other thing. We need more fields in the results - so we can query them as if they are real data. So, we need these State fields:

location
course
speed

Yep, this is where I wanted to know whether if it should be static data or dynamic.

So, we'll need the following additional capabilities: 1) Bounds per track 2) Sources per track 3) Dynamic location, course, and speed based on bounds

I'll start on them.

Thanks for the random generation snippet.

IanMayo commented 3 years ago

@arunlal-v has produced a demonstrator in SQLFiddle.

I suggest the demonstrator "grows" with these steps:

rnllv commented 3 years ago

Updated with SQLFiddle links

vessel on steady course/speed for fixed length of time random course changes [Can be implemented with the same logic as for random speed, but this has to be integrated along with the bounding box logic] random speed changes vary length of track vary start locations [Can be implemented with the same logic as for random speed, but this has to be integrated along with the bounding box logic] create tracks on multiple sensors per track per day create data for multiple platforms per day create data for multiple days create data to reach specified total number of rows

rnllv commented 3 years ago

Updated with SQLFiddle links

vessel origin is a random point within a bounding box

rnllv commented 3 years ago

Vary speed using probabilistic methods

rnllv commented 3 years ago

Vary course using probabilistic methods

rnllv commented 3 years ago

Vary course and speed using probabilistic step factor and change factor

rnllv commented 3 years ago

Variable length of track

rnllv commented 3 years ago

Multiple sensors of same platform over different tracks over days (constrainted track length)

IanMayo commented 2 years ago

Aah, I've seen a high volume dataset here: https://marinecadastre.gov/ais/

We could probably retrieve and load the data direct from the original hosting site.

Here is pandas working with some of that data, thanks to Ed. image