aisstream / issues

7 stars 3 forks source link

Data with error and missing data #37

Closed Zia- closed 11 months ago

Zia- commented 11 months ago

Hello,

To experiment with the API, I ran the WebSocket between Monday, July 24, 2023 9:08:46 AM (GMT) and Monday, July 24, 2023 9:33:46 AM (GMT) time bin to fetch everything for the entire globe. Per the AIS Stream screenshot below, we are supposed to get data along many Asian coastlines.

image

However, as seen in the screenshot below (you see the MMSIs path), much data must be included. Also, there are a few corrupt ones, as seen by those long lines intersecting landmasses.

image

Could you help me understand this missing data behaviour and the source of that data error? Thanks a lot.

Note: I'm not doing any post-processing on the fly and am confident that all data was handled gracefully without hitting any I/O rate issue (i.e. not being able to consume more than 300 messages per second). Also, I'm happy to share the code if that helps.

aisstream commented 11 months ago

The above coverage image is a rough approximation of the area's we have data coverage for. There is no guarantee that we receive regular and consistent data for all area's on the map. Please be reminded that aisstream.io is a free service that makes no guarantees of the accuracy, coverage or delivery of the data provided by the service.

As you have done, the best method for determining if the area/vessels you wish to receive data for is covered is to test the api over a reasonable period of time. For east asian coastlines it is recommend to test for a longer period as data is sparser and infrequent compared to north america/western europe.

We will review south-east Asia over the coming days to see if there are unexpected gaps in our service. There is a possibility of a bug in our bounding box implementation.

We did review a few areas we expected to see data for that were not marked in your map. When we checked these area's we received messages as expected.

If you could provide your code we would gladly review it at some point to compare our results to yours.

Zia- commented 11 months ago

Thanks a lot. Make sense.

The reason I didn't run my script longer than 25 mins is to avoid the websocket connection lost issue https://github.com/aisstream/issues/issues/35. It makes sense that I should have run it for at least a day to see what we get. The code I'm using is the same one mentioned here https://github.com/aisstream/issues/issues/35.

Is there any sort of data pre-processing (cleaning, generalisation etc.) being done at your end before making it public? It would help us to know what cleaning is needed on our end.

aisstream commented 11 months ago

As a quick sanity check we tested a bounding box over south-east Asia and we definitely do receive data. We primary receive updates for the Singapore hong kong region but this is to be expected IMO. This disagrees with your chart above which show no data being received.

Zia- commented 11 months ago

Could it be me using the entire globe as a bounding box? I was assuming that as long as I'm consuming >300messages/second, I'm good to go with any bounding box size.

Zia- commented 11 months ago

Regarding the error in the data, is it something you guys are receiving from upstream? I'm just wondering if any sort of filtering is being done at your end.

aisstream commented 11 months ago

There is no filtering or removing of data, apart from bounding box, mmsi i/message type if provided by the subscription.

Zia- commented 11 months ago

Thanks a lot. I'll check again using multiple non-overlapping bounding boxes covering the entire globe to see if a different behaviour happens.

Zia- commented 11 months ago

Quick Q: If I go down the road of using multiple non-overlapping bounding boxes to cover the entire globe (which I still need to test), how shall I manage the API key?

  1. Shall I use the same API key for all parallel running scripts?
  2. Shall I generate multiple API keys for each script from the same GitHub account?
  3. Shall I generate multiple API keys from multiple GitHub accounts for these scripts?

Does it going to have any impact on the messages I would receive or their I/O rates?

aisstream commented 11 months ago

In theory you should not have to use multiple non overlapping bounding boxes. If you find there is an improvement please let us know as that means there is something wrong with our implementation.

To answer your, we currently do not rate limit and in the event we do, it will be well publicized. That being said the safest would be multiple account with multiple api keys.

Zia- commented 11 months ago

I will notify you of any relevant results/observations regarding using multiple non-overlapping bboxes.

Regarding APIs, I would prefer using the safest route. Thanks a lot.

Zia- commented 11 months ago

Hello again,

Recently, I used 0aa8017c291bbebc2fbc4523760cf787ff6fa6bb API-key to grab [[[-179,-89], [179,89]]] data, and 99d5021e72acb482868549a89fdf79a202370a61 to grab [[[80,-20], [179,79]]]. The first one roughly covers the entire globe; however, the second one is focused on far-east Asia.

Below is what I got after running the global script for a couple of hours. As for the one focused on far-east Asia, I didn't get a single message in spite of running for hours. And, so the assumption that using multiple non-overlapping bboxes is going to work doesn't seem to work.

Screenshot 2023-08-03 at 16 02 12

I would appreciate it if you kindly look into it. Thanks a lot.

Zia- commented 11 months ago

You mentioned that a small bbox, around far-east Asia, worked for you previously and you did receive some data. What bbox did you use? We have a feeling that somewhere the lat-long is getting wrongly matched to x-y (instead of y-x).

Zia- commented 11 months ago

The assumption was right!

Using [-90, -180, 90, 180] bbox for the globe, instead of [-180, -90, 180, 90], we received a full coverage by just running the script for a minute.

Screenshot 2023-08-03 at 16 29 22

So, shall we start using this flipped version or can you guys do a hot-fix in your production code to achieve a more intuitive bbox syntax?

Note: we are using https://wiki.openstreetmap.org/wiki/Bounding_Box bbox syntax.

Zia- commented 11 months ago

We just realised that you indeed are using [-90, -180, 90, 180] in your docs https://[aisstream.io/documentation](https://aisstream.io/documentation). However, per the geospatial definition, it should be the other way round.

At this point, it seems like if you flip it at your end, it will definitely break others' pipelines already using the AIS stream API. Kindly let me know if I should continue with this current implementation. Happy with either of the approaches.

aisstream commented 11 months ago

Yes, we will likely not be changing the implementation and swapping lat/long unless we release a new version of the api.

Zia- commented 11 months ago

Thanks a lot. All your help, support and clarification is highly appreciated. Looking forward to using the AIS stream in the long term.