I wish to: easily import data from the STIRData platform into my favorite data science tool or toolchain
So that: I can benefit from my existing knowledge of the tool, as well as combining the data from the platform with other data I have available
Related to: #14 , #33
There is a growing use of tools for analysis and visualisation, such as Python Notebooks, Python Pandas, PowerBI, Tableau etc.
While these tools typically will be able to import many types of data sources, for instance CSV, you will often need to make some manual work to get the full benefit. For instance, the Norwegian Business Register offers a dump as either XLSX or JSON. For Python Pandas, none of these formats will correctly identify the best datatype for the different columns, so that it must be specified manually. Also, traditional file formats such as JSON and CSV are not optimised for instance for compression of data. See for instance this article for the difference in size, speed (and cost) for using Apache Parquet vs CSV on Amazon: https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
As a: Data Scientist
I wish to: easily import data from the STIRData platform into my favorite data science tool or toolchain
So that: I can benefit from my existing knowledge of the tool, as well as combining the data from the platform with other data I have available
Related to: #14 , #33
There is a growing use of tools for analysis and visualisation, such as Python Notebooks, Python Pandas, PowerBI, Tableau etc.
While these tools typically will be able to import many types of data sources, for instance CSV, you will often need to make some manual work to get the full benefit. For instance, the Norwegian Business Register offers a dump as either XLSX or JSON. For Python Pandas, none of these formats will correctly identify the best datatype for the different columns, so that it must be specified manually. Also, traditional file formats such as JSON and CSV are not optimised for instance for compression of data. See for instance this article for the difference in size, speed (and cost) for using Apache Parquet vs CSV on Amazon: https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and