AeroPython / flight-safety-analysis

Analysis of aircraft accidents data
MIT License
18 stars 16 forks source link

Database access #4

Open AlexS12 opened 6 years ago

AlexS12 commented 6 years ago

Currently some functions under flight_safety/queries are provided to access to the data base filter following a criteria and return a DataFrame with the suitable column types.

Issues:

Would something like sqlalchemy (https://www.sqlalchemy.org/) or pypika (https://github.com/kayak/pypika) help here?

astrojuanlu commented 6 years ago

Regarding the filtering: the more functions you provide, the more you approach to reimplement SQL... SQLAlchemy won't be of much help here, since it's intended to serve as an abstraction layer to several databases, and if I understand correctly, our data will always be in SQLite. For arbitrary queries, I think users will have to learn SQL anyway.

If pandas dtype inference is not working well, some manual work will be needed... This is unavoidable and someone has to do it. A series of transformations should be written to prepare the data and present it with a clean schema. Welcome to the "data wrangling" world :cowboy_hat_face:

AlexS12 commented 6 years ago

This is related to https://github.com/AeroPython/flight-safety-analysis/issues/2, but given that we have to transform the database from access, would it be a good approach to transform it a to an hdf5 file with the suitable dtypes? That would remove the dtype inference from filtering functions and we wouldn't use SQL anymore.

astrojuanlu commented 6 years ago

I don't have experience reading multi-table HDF5 files and not everyone loves the format, so I cannot comment here.

AlexS12 commented 6 years ago

Thanks for the link! Moving this kind of data from a SQL database to an hdf5 does not seem a sensible option.