Closed lintool closed 6 years ago
What is the use case for DataFrames support via spark? Filtering & sorting by categorical var?
The dataframe documentation in scala seems pretty straight forward. I guess it just depends on how much a priority it is for the base.
Another consideration is that if you want people to migrate to aut from warcbase, dataframe support could be nice selling feature.
What is the use case for DataFrames support via spark? Filtering & sorting by categorical var?
Correct. In our proposed FAAV cycle [1], DataFrames would support simplified easy filtering.
So the status quo (no dataframes) would require users to map & filter into a derivative, output as delimited, Json or some flavor of XML and then pump that into pandas or R. I take it the main problem with this is that Pandas could end up with useless garbage -- or worse, find that necessary info was filtered out.
I assume supporting dataframes in Warcbase would capture problems at an earlier stage and then output to a nice data table that could be imported to pandas, R or some other useful format (plain old json would work for me).
I think having an example output using the dataframe would be nice. I am ambivalent as to when, although if I really wanted to see it in action, I could try to just implement it in spark myself. I think waiting for pyspark is probably the best option.
PySpark and DataFrames are two orthogonal issues.
PySpark makes it easy to integrate AUT with other Python based tools. DataFrames makes the filtering process more intuitive. They make a powerful combo!
With DataFrames, we can do things like:
pages.filter(pages['url'].startswith('foo.com') and 'bar' in pages['contents']).count().show()
Translation: count all pages that are in the foo.com
domain that contain the keyword bar
.
More details: https://spark.apache.org/docs/latest/sql-programming-guide.html
Ian and I were talking about having some network algorithms applied to network data on extraction to make visualisation easier (and reducing load on client-based scripts). This could be a good way to provide a use-case & testing for DataFrames in spark. I would want to calculate some value (eg. eigenvector centrality, based on degree, indegree or outdegree), apply it as a column and then use that column to filter-out websites as not influential.
The benefit to this would be two-fold. A) I could use eigenvector centrality (and therefore other calculations) as a node attribute, enriching visualisations. B) We could provide pre-filtered results to user-base based on our expertise in network analysis.
Assigned to @MapleOx – after #12 is completed. Aiming for November.
Done and documented.
We need DataFrames support. Not sure if the best course of action is to resolve #12 and build PySpark DataFrames directly, or to build DataFrame support and hash out in Scala before making sure everything works in PySpark.