archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

DataFrames support #13

Closed lintool closed 6 years ago

lintool commented 7 years ago

We need DataFrames support. Not sure if the best course of action is to resolve #12 and build PySpark DataFrames directly, or to build DataFrame support and hash out in Scala before making sure everything works in PySpark.

ianmilligan1 commented 7 years ago

I'm the least experienced here, but after seeing the various implementations of DataFrames in Scala and Python, it probably is more economical to resolve #12 and build PySpark DataFrames directly?

This is going to be a great feature, btw.

greebie commented 7 years ago

What is the use case for DataFrames support via spark? Filtering & sorting by categorical var?

The dataframe documentation in scala seems pretty straight forward. I guess it just depends on how much a priority it is for the base.

Another consideration is that if you want people to migrate to aut from warcbase, dataframe support could be nice selling feature.

lintool commented 7 years ago

What is the use case for DataFrames support via spark? Filtering & sorting by categorical var?

Correct. In our proposed FAAV cycle [1], DataFrames would support simplified easy filtering.

[1] http://dl.acm.org/citation.cfm?id=3097570

greebie commented 7 years ago

So the status quo (no dataframes) would require users to map & filter into a derivative, output as delimited, Json or some flavor of XML and then pump that into pandas or R. I take it the main problem with this is that Pandas could end up with useless garbage -- or worse, find that necessary info was filtered out.

I assume supporting dataframes in Warcbase would capture problems at an earlier stage and then output to a nice data table that could be imported to pandas, R or some other useful format (plain old json would work for me).

I think having an example output using the dataframe would be nice. I am ambivalent as to when, although if I really wanted to see it in action, I could try to just implement it in spark myself. I think waiting for pyspark is probably the best option.

lintool commented 7 years ago

PySpark and DataFrames are two orthogonal issues.

PySpark makes it easy to integrate AUT with other Python based tools. DataFrames makes the filtering process more intuitive. They make a powerful combo!

With DataFrames, we can do things like:

pages.filter(pages['url'].startswith('foo.com') and 'bar' in pages['contents']).count().show()

Translation: count all pages that are in the foo.com domain that contain the keyword bar.

More details: https://spark.apache.org/docs/latest/sql-programming-guide.html

greebie commented 7 years ago

Ian and I were talking about having some network algorithms applied to network data on extraction to make visualisation easier (and reducing load on client-based scripts). This could be a good way to provide a use-case & testing for DataFrames in spark. I would want to calculate some value (eg. eigenvector centrality, based on degree, indegree or outdegree), apply it as a column and then use that column to filter-out websites as not influential.

The benefit to this would be two-fold. A) I could use eigenvector centrality (and therefore other calculations) as a node attribute, enriching visualisations. B) We could provide pre-filtered results to user-base based on our expertise in network analysis.

ianmilligan1 commented 7 years ago

Assigned to @MapleOx – after #12 is completed. Aiming for November.

ianmilligan1 commented 6 years ago

Done and documented.