Open coderbyheart opened 4 years ago
If you ever tackle this I would be interested to help. Also I have some (non-working) code for a custom cdk resource.
Some ideas about the topic:
AWS-Athena is probably not a good choice since text file sources will be billed fully unless partitioned, also we typically need exactly 1 entry
AWS-Datapipeline does not get any love from what I saw looking around (bad availability, no new features, no cdk support, etc., data import job is done by some jar)
AWS-Glue seems to be a good choice, actual job managing code needs to be python or scala. You only need a job everything else can be handled inside the python code as data schema will not change
Data access could be done through an internal API or just by calling a query
I would have used DynamoDB, but i guess almost everything could work. Partition key could be tricky, maybe <mcc>-<mnc>
or <mcc>-<mnc>-<lac>
and sort key <lac>-<cell>
or <cell>
Thank you for the input.
Can you give some context, why you prefer this approach over using UnwiredLabs?
Regarding the implementation: right now we have a DynamoDB table with the cellId as key, which is used for all lookups:
All data (from devices but also from UnwiredLabs API) is stored with that key, so I imagine using the download data as a seed ... and that should be pretty straightforward to be implemented in a lambda, given the initial import can be done from a local machine, and then daily updates can be done in a lambda (they are around 1MB per day).
I would prefer the approach over a commercial application since costs will come into play, a per device pricing as they offer it will possibly be prohibitive. And a limited request offer, well its limited. Also handling legal aspects is easier if you do not involve too many third parties. Also Google's solution costs 0,005 USD per location resolution.
What I do fear is the quality of the data as you mentioned there are different data sets and a quick check of my collegues resulted in mixed results sometime resolving correctly, sometimes leading to wrong positions, etc.
Importing from local makes sense to me, but I prefer stacks which can handle everything automatically for better CI. Also this is not very hard to do since the aws-cdk framework for custom resources works quite well. Also the lambda costs will be around 0,02€ for a 15min run. Full import can be done in ~10min, since the download speed is around 2MB/s (measured lambda eu-central-1). The glue costs are quite high tough, could be ~4€ for the full import (just a guess from my last runs). So from local would make more sense if we also do the dynamodb part.
Thanks for the link, yes that should work as an index since we only query single results.
Side Note: The unwiredlabs file routes return a 200 HTTP code even if the quota is exceeded (should be 429 imho, wrote them an email but got no response yet), which makes it hard to process
Here is the thing, though. In Bifravst there is already a cache for UnwiredLabs responses, so there is only one charge per cell, every subsequent query to get a device location based on the same cell will not result in an additional API query to a third party. This drives down costs significantly, and especially deployments with devices in the same cells means they all can share this data. In addition the cat trackers contribute to this database and over time the devices itself provide location data. This will only give you a rough location estimate (kilometer resolution), because right now the nRF9160 modem does only provide the id of the cell tower it is not connected. In the future it might be that we get the list of cells it "sees" and the signal strength, this will allow to triangulate the position, however, again, not very accurate, since many factors affect signal strength.
Looking at the worst case scenario: you operate a worldwide IoT deployment and need to know the location of each of your devices immediately (e.g. because you are tracking the delivery of a COVID-19 vaccine worldwide). The current cell_towers.csv.gz
has 42,951,312 LTE cells listed, so let's assume you need to resolve 10% of those cells (which btw. also means your fleet has 4 million devices), you are looking at costs of $21,475 using Google. I'd say these costs are negligible given that you will also need to invest around $100M in the hardware (assuming cost per tracker of $25 dollars).
So my point here is that from a cost perspective using a third party service is not expensive, and will offer higher quality data.
Thanks for pointing out the caching fact, I actually did not see this (but then you also have to respect a certain ttl of the data since things change from time to time).
For the cost perspective: You are looking at this from a owner = operator perspective thus operating costs are low compared to investment. If you contract for operations only or say a SaaS business these 20k there and 20k there add up minimizing your profit or even prohibit the business model. Your calculations are a good point, but still 20kUSD is a lot of money.
https://bifravst.github.io/cell-geolocation-helpers/ does not work (live demo referenced from the docs)
For the cost perspective: You are looking at this from a owner = operator perspective thus operating costs are low compared to investment. If you contract for operations only or say a SaaS business these 20k there and 20k there add up minimizing your profit or even prohibit the business model. Your calculations are a good point, but still 20kUSD is a lot of money.
If you are providing production services for 4 million IoT devices and you are struggling with costs of $20k, even per month, I'd say your business model is wrong. Thanks anyway for providing this input, it helps to understand why this feature might be relevant. I am really open for a contribution, and I don't see this to be a big effort: it should be integrated as an option (like the UnwiredLabs feature) into the cell geolocation resolution state machine (step functions), and then have a cron lambda that fetches the daily updates. Writing the tests will probably be the most interesting challenge because we can't download the full set for every test run, but need to fake it (e.g. seed with a known data subset).
https://bifravst.github.io/cell-geolocation-helpers/ does not work (live demo referenced from the docs)
Thank you for reporting this, it's fixed now.
Instead of querying the UnwiredLabs geolocation API (which is only free for a limited amount of request), an API serving the free OpenCelliD data could be implemented.
It would be seeded with the large dataset once, and then download the deltas.
This will reduce the costs for user scenarios where they need to resolve a lot of different cell locations. It also removes the inherent transmission of metadata about the devices in a solution (in practice UnwiredLabs will know where devices from this solution are located globally).
The downside is that it will increase the cost for running Bifravst.
Please use the reaction feature (:+1:) to mark if this would be valuable for you.