Pipeline step 0: Read partitioned files from S3 in .csv or .parquet

Watts-Lab / daphme

Data Access Platform for Human Mobility in Epidemiology

0 stars 0 forks source link

Pipeline step 0: Read partitioned files from S3 in .csv or .parquet #15

Open GolanTrev opened 3 weeks ago

GolanTrev commented 3 weeks ago

Like in these:

https://github.com/Watts-Lab/covid_gps/blob/main/covid-clustering/%5BF%5Dsubset_phl.ipynb
https://github.com/Watts-Lab/cf-networks/blob/master/%5BF%5Dmake_data.ipynb
https://github.com/mindearth/mobilkit/blob/main/examples/01_mobilkit_example.ipynb
[x] Conceive a unit test that makes sense with internal sample data
[x] (Thomas) provide the sample data
[ ] Code and pass a test that imports and formats

We want to pass tests for functions like those, in daphme/io.py .

thom-li commented 3 weeks ago

Sample data is located in arn:aws:s3:::synthetic-raw-data. There are 10-, 100-, and 1000-user options. Each contains dataframes for ground-truth trajectories, sparse sampled trajectories, diaries, and agent homes/workplaces. The trajectories for each agent is 2 weeks long at 1-minute intervals. The sparse trajectories are sampled at either a low or high frequency.

GolanTrev commented 2 weeks ago

Reader class, instantiated with a dictionary mapping file column names to internal column names (references?). We test reading in a folder with partitioned data in multiple .csv (in the future, multiple parquets in multiple folder).

Test: assert whether the loaded object is a pandas dataframe and has the right columns (lat or x, lon or y, time, ha, possibly more).