adsblol / globe_history_2024

✈️🗄 2024 Historical data for all aircrafts traces known to adsb.lol. Openly licensed.
https://adsb.lol
Open Data Commons Open Database License v1.0
31 stars 0 forks source link

Analysis-ready Parquet download? #2

Open marklit opened 6 months ago

marklit commented 6 months ago

I built an ETL script that turns the current download into a parquet file. It has names for every field, is columnar-formatted so it is much quicker to query and it is compressed with ZStandard so a day's worth of data is still around 1.2 GB. There is also H3 indices which help filter specific geographies quickly.

https://tech.marksblogg.com/global-flight-tracking-adsb.html

Is there any chance the above ETL script could work its way into your infrastructure and produce a daily Parquet file in addition to the current daily download tar file?

iakat commented 6 months ago

Hi Mark, thank you making this issue,

While I am in principle not opposed to having other formats of the data,

Before considering something like this, I need the files to have their ‘gaps’ accounted for.

As you know, when readsb restarts for any reason (configuration change being the most common) one readsb (let’s say -0) will go down while the other will keep running. Then once 0 is back 1 will go down and restart. This will result in a few minutes of unique data for each file, which is why they are both there.

So basically, I need to solve this problem first with the globe_history format before moving forward.

Make sense?

On Tue, 5 Mar 2024 at 09:47, Mark Litwintschik < @.***> wrote:

I built an ETL script that turns the current download into a parquet file. It has names for every field, is columnar-formatted so it is much quicker to query and it is compressed with ZStandard so a d DuckDuckGo removed one tracker. More https://duckduckgo.com/-uEgSxrlRvtp75UHOCEW7NotPVd3Be6g9xTrg0qR4_f0vQMLIEUum_juOmorLouK0Ii9oFFM9H44HytK0pZNQjIvI0221nv6wv0HbzC9G1c82W69vqPlG1i-eihth2nI0BBTbJQAbr2fxypS99oYYfA6duer_sJgmIwlnrrTq8oirkYzXX3xjDQ91LI3ee4NEP2l-h5jyMqsfXYvE36befDfOaJx18XlcIlUo-IzxS7VDpm1qOpZUVX0s6XCbwUncq217gcYezMtq8gp1owHU1XhK1e-3ACKoR9ZXRFVtojzLgK__9oBsKus2vp_GeaTiAJD3nvAvff5D--W7H60szcfLZY332MgEHS92GGKGf8QxK8DsMol2NvjXS3aKILvvKoLiM07SHOgRz9UpP5CRezWlQG2RZ_md0xsHLEj0h4mZ1f4NSPR7mciiwVL03K9VpscgtPTlTI1MiUceW98D07-vH6w2XddjKPe5jRmrNL6Xn8j75kDJZ2h8 Report Spam https://duckduckgo.com/-uEgSxrlRvtp75UHOCEW7NotPVd3Be6g9xTrg0qR4_f0vQMLIEUum_juOmorLouK0Ii9oFFM9H44HytK0pZNQjIvI0221nv6wv0HbzC9G1c82W69vqPlG1i-eihth2nI0BBTbJQAbr2fxypS99oYYfA6duer_sJgmIwlnrrTq8oirkYzXX3xjDQ91LI3ee4NEP2l-h5jyMqsfXYvE36befDfOaJx18XlcIlUo-IzxS7VDpm1qOpZUVX0s6XCbwUncq217gcYezMtq8gp1owHU1XhK1e-3ACKoR9ZXRFVtojzLgK__9oBsKus2vp_GeaTiAJD3nvAvff5D--W7H60szcfLZY332MgEHS92GGKGf8QxK8DsMol2NvjXS3aKILvvKoLiM07SHOgRz9UpP5CRezWlQG2RZ_md0xsHLEj0h4mZ1f4NSPR7mciiwVL03K9VpscgtPTlTI1MiUceW98D07-vH6w2XddjKPe5jRmrNL6Xn8j75kDJZ2h8

I built an ETL script that turns the current download into a parquet file. It has names for every field, is columnar-formatted so it is much quicker to query and it is compressed with ZStandard so a day's worth of data is still around 1.2 GB.

https://tech.marksblogg.com/global-flight-tracking-adsb.html

Is there any chance the above ETL script could work its way into your infrastructure and produce a daily Parquet file in addition to the current daily download tar file?

— Reply to this email directly, view it on GitHub https://github.com/adsblol/globe_history_2024/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM553LXP35IMWZ74DYTJETYWWBCNAVCNFSM6AAAAABEGWDOTSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3DQNRTG4YTGOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

wiedehopf commented 6 months ago

Hey nice blog post! :)

If you're gonna make such a nice new format you should include info if the airplane is on the ground.

            'altitude':
                trace[3]
                if str(trace[3]).strip().lower() != 'ground'
                else None,

I didn't see that saved anywhere. Possibly just a bool in your scheme?

You probably already referenced it while using the data, but here is some explanation on the format: https://github.com/wiedehopf/readsb/blob/dev/README-json.md#trace-jsons The aircraft object is only present for every 4th point but i assume you didn't need much data from there / your DB scheme handles that somehow.

Also sorry for the format, it's a bit of a mess.

iakat commented 6 months ago

@marklit of course nothing is preventing you from tackling this project yourself and making the parquet-ready data available similar to this repo. :)

alexey-milovidov commented 5 months ago

@marklit, I've created a ClickHouse database with the data and also added ADSB-E: https://github.com/ClickHouse/adsb.exposed/ Connect me if there are further ideas.