ericleasemorgan / reader

Distant Reader, a tool for using & understanding a corpus
GNU General Public License v2.0
20 stars 7 forks source link

Visualize relationships between any number of extracted features #100

Open ericleasemorgan opened 4 years ago

ericleasemorgan commented 4 years ago

The Reader excels at: 1) feature extraction, and 2) listing those features. Features include named entities, parts-of-speech, email addresses, URLs, latent themes/topics, keywords, ngrams, etc. The listing of features is usually interesting and easy to interpret, but many times the student, researcher, or scholar wants to discover the relationships between features and answer questions such as, "What features are central to a corpus?" or "What features are highly connected to other features?" The way to address these sort of questions is through (interactive) network diagrams.

Some of this work has been previously been done by Team JAMS -- a few computer science students who took part in a PEARC hack-a-thon. (Team JAMS won first prize for their good work.) I took their efforts as a starting point, abstracted it, and made it a part of a different repository called "reader-workbook". See:

The first script (carrel2diagram.sh) is merely a front-end to everything else. The second script (carrel2json.py) does the hard work. It creates a stream of JSON and saves the result to the file system. The third script (template2html-diagram.sh) merely reads the template (template-diagram.htm), does a substitution, and sends the result to STDOUT as a stream of HTML. Finally, when the resulting HTML is loaded, a cool Javascript library (D3.js) reads the JSON and outputs a network diagram. The process works pretty well, and the resulting diagrams are very interesting, but the process is not scalable and it only functions against a tiny handful of our extracted features (namely, different types of nouns).

Your mission, if you choose to accept it, it to incorporate this into our repository, and increase scalability by parallelizing this whole process, probably by editing carrel2json.py. Remember, you will have at least 24 cores at your disposal. In the end, the output will include at least one network diagram of nouns saved in a study carrel at ./htm/network-diagram.htm.

For extra credit, create four different network diagrams, one for each different type of noun found in carrel2json.py.

Once we get this far, we will explore the creation of even more network diagrams illustrating the relationships between any number of things such as: 1) authors and keywords, 2) types of entities and DOIs, or 3) dates and places.

In my mind, this is the most difficult hack to write, but the results will be one of the most well-respected features of a Distant Reader study carrel.

ericleasemorgan commented 4 years ago

How goes the work on this task?