Statistical treatment of the data

I had a look in more details to the data, specifically the office and train data. Thanks to @ s___m__ on Twitter for getting started with some of the necessary work.

I was very surprised at first to see the following jumps between 1.5m and 3m thresholds:

train

train_confusion_15 train_confusion_3

office

office_confusion15 office_confusion_3

In both cases it seems things go so much better with the 3m threshold than the 1.5m one, which is counter-intuitive. I then checked the histograms (train, then office): train_histogram office_histogram

Of course it now makes sense: you can't misclassify many signals at range >3m if you don't take many such measurements.

On the other hand, it is kind of alarming, for two different reasons:

For the office situation, the number of people at distance <=r is expected to grow quadratically in r (for open floorplans). You are very much undersampling there.
I prefer my public transport to be shaped in mostly one dimension, so the thinking is different there. What I was looking for was data that could be meaningfully compared to data in this now-classic study. They have shown very unexpected patterns in the strength of the Bluetooth signal as a function of distance (for instance, that the signal strength increases with distance, sometimes). There is simply no space in the data you have collected to uncover this phenomenon.

In the end, it looks like you just summed the sampled data for all scenarios to get a big dataset, that was then used to pick thresholds.

Have you checked what you were doing with a statistician?

DP-3T / bt-measurements

Statistical treatment of the data #4

train

office