datatogether / research

📚 A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity
Creative Commons Attribution Share Alike 4.0 International
91 stars 11 forks source link

Pre-processing coverage data for Data Visualizations #6

Open flyingzumwalt opened 7 years ago

flyingzumwalt commented 7 years ago

@mhucka has been exploring ways to facilitate visually drilling down into the coverage data (aka. public record of all the data held by participating orgs). Discussion of dataviz options here: https://github.com/datatogether/research/tree/master/data_visualization

This will inevitably require pre-processing of the data, partially because you often end up with situations where there are tens of thousands of items (ie. URLs) at a given layer of the navigation tree. In addition to pre-processing based on simple analysis of the content, such as running files through FITS to extract content types, there is clearly a need for deeper machine analysis. At the very least you could use entity extraction to identify patterns/topics within a corpus.

@mhucka has already been working on some of this. Let's rope in a few more people. @chrpr and @mejackreed come to mind.

The ETL pattern seems pretty applicable, and opens opportunities for experimenting with incorporating distributed data and distributed tools into machine analysis pipelines:

  1. aggregate the essential info into a workable dataset (currently tracking info in a SQL database, eventually will be distributed)
  2. analyze that dataset
  3. write the analyzed/reformatted result (ie. to IPFS)
  4. pass around a reference to the updated/processed/extended dataset (ie. IPFS hash)