STIRData / user-stories

A repository to hold the epics and user stories used in the project-board "Epics and User Stories"
0 stars 0 forks source link

[USER STORY] Continue analysis and visualisation of the data in my favorite tool #37

Open sskagemo opened 3 years ago

sskagemo commented 3 years ago

As a: Data Scientist

I wish to: easily import data from the STIRData platform into my favorite data science tool or toolchain

So that: I can benefit from my existing knowledge of the tool, as well as combining the data from the platform with other data I have available

Related to: #14 , #33

There is a growing use of tools for analysis and visualisation, such as Python Notebooks, Python Pandas, PowerBI, Tableau etc.

While these tools typically will be able to import many types of data sources, for instance CSV, you will often need to make some manual work to get the full benefit. For instance, the Norwegian Business Register offers a dump as either XLSX or JSON. For Python Pandas, none of these formats will correctly identify the best datatype for the different columns, so that it must be specified manually. Also, traditional file formats such as JSON and CSV are not optimised for instance for compression of data. See for instance this article for the difference in size, speed (and cost) for using Apache Parquet vs CSV on Amazon: https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and