Pre-processing coverage data for Data Visualizations

@mhucka has been exploring ways to facilitate visually drilling down into the coverage data (aka. public record of all the data held by participating orgs). Discussion of dataviz options here: https://github.com/datatogether/research/tree/master/data_visualization

This will inevitably require pre-processing of the data, partially because you often end up with situations where there are tens of thousands of items (ie. URLs) at a given layer of the navigation tree. In addition to pre-processing based on simple analysis of the content, such as running files through FITS to extract content types, there is clearly a need for deeper machine analysis. At the very least you could use entity extraction to identify patterns/topics within a corpus.

@mhucka has already been working on some of this. Let's rope in a few more people. @chrpr and @mejackreed come to mind.

The ETL pattern seems pretty applicable, and opens opportunities for experimenting with incorporating distributed data and distributed tools into machine analysis pipelines:

aggregate the essential info into a workable dataset (currently tracking info in a SQL database, eventually will be distributed)
analyze that dataset
write the analyzed/reformatted result (ie. to IPFS)
pass around a reference to the updated/processed/extended dataset (ie. IPFS hash)

datatogether / research

Pre-processing coverage data for Data Visualizations #6