Quansight / pycon2020-200-billion-gps-points

Analyzing 200 billion GPS Points with Python on the Cheap
BSD 2-Clause "Simplified" License
0 stars 1 forks source link

Target Dataset? #2

Open dharhas opened 4 years ago

dharhas commented 4 years ago

Use openstreetmap 3billion https://examples.pyviz.org/osm/osm-3billion.html or anonymized version of internal dataset

tylerpotts commented 4 years ago

@dharhas I downloaded and decompressed the below file to qgpu2:

http://planet.nchc.org.tw/Planet.osm/gps/gpx-planet-2013-04-09.tar.xz

After decompression, there was 284 GB of XML files. In hopes of avoiding parsing all of that to get it into parquet, I'm downloading a slightly older one that's in csv format:

http://planet.nchc.org.tw/Planet.osm/gps/simple-gps-points-120604.csv.xz

This one is 15 GB compressed vs the 24 GB XML compressed one, but it might end up being similar amounts of data due to all of the filler text in the XML format after decompression.

Let me know what you think

dharhas commented 4 years ago

So these are gpx files, instead of processing the xml directly it would be simpler to use https://pypi.org/project/gpxpy/

the link below shows you how to get the data into a pandas dataframe. we should be able to wrap/modify to use dask instead.

https://ocefpaf.github.io/python4oceanographers/blog/2014/08/18/gpx/

if it becomes too painful we can use the csv.

dharhas commented 4 years ago

We should be able to do this:

import dask.dataframe as dd
import dask
import glob

@dask.delayed
def parse_gpx(file):
    # parse file into pandas df here using gpxpy
    return df

files = glob.glob('*.gpx)
df = dd.from_delayed(files.map(parse_gpx))
df.to_parquet('unsorted_data.parq')
tylerpotts commented 4 years ago

@dharhas I ended up processing the csv data into parquet before I read your comment. When I checked the size the csv gave us ~2.9 billion rows of data.

I'll try the gpx data as well. It won't hurt to have two datasets, and I think the gpx one is should be a lot bigger