DARIAH-DE / TopicsExplorer

Explore your own text collection with a topic model – without prior knowledge.
https://dariah-de.github.io/TopicsExplorer
Apache License 2.0
62 stars 10 forks source link

Visualizations not displaying #58

Closed juliaflanders closed 6 years ago

juliaflanders commented 6 years ago

I ingested a corpus of about 2100 short documents (UTF-8, no XML markup) and the progress bar showed successful completion of all the processing steps. (I used a stop word list of my own; I chose 100 topics and 100 iterations.) However, when the results opened up in the TopicsExplorer, the visualizations in sections 2.2 and 2.3 were invisible. (Screen shot attached.) The heat map controls are visible but not the data. The bar chart in section 2.4 is visible.

I exported the files using the "Export Graphics and Tables" feature, and opened the resulting HTML files (i.e. topics_barchart.html and heatmap.html) in a variety of browsers (Safari, Firefox, Chrome), but no heat maps or other visualizations were visible although the controls were visible. I also tried various things like selecting all content, clicking, reloading, and also just waiting to see whether perhaps there was a delay in case this was a large amount of data. The corpus_statistics.html file did display successfully.

I am running MacOS 10.13.6 on a MacBook Pro. The dariah_topic_modeling_interface image is of the TopicsExplorer interface. The dariah_topic_modeling_html image is of the exported HTML of the topics barchart (topics_barchart.html).

dariah_topic_modeling_interface dariah_topic_modeling_html

This tool seems incredibly useful. Please let me know if there's any other information you need or other troubleshooting I need to do.

Thanks! Julia

severinsimmler commented 6 years ago

Hi Julia,

first of all many thanks for the detailed bug report. 2100 documents is definitely a reasonable corpus for topic modeling, but the larger the corpus, the more difficult it becomes to visualize the entire result at once. In your case the heat map would have (2100 documents x 100 topics =) 210000 cells. The visualizations in the Topics Explorer are interactive (and based on JavaScript), so I assume that the program reaches its technical limits with such complex graphics. We will discuss this in our development team, if and how to solve this. Just keep an eye on this thread, but you should get an email anyway if something is posted here.

What can you do until then? In the ZIP archive you saved there is a file document_topics.csv. This is the calculated probability distribution of topics over the documents. All visualizations in the Topics Explorer are based on this matrix. For example, you could try to visualize the results with Microsoft Excel, or, depending on your programming skills, you could have a look at the Python library Topics. There is also a Jupyter notebook that explains the whole topic modeling workflow step by step. Here you are a bit more flexible with the visualizations and could, for example, create a static heat map, which is probably technically possible and not so complex, or make a selection for the documents that you want to visualize in a heat map.

severinsimmler commented 6 years ago

Have you ever tried e.g. to select only 100 documents at 100 topics, whether the visualizations are then displayed?

juliaflanders commented 6 years ago

Hi Severin--

Thank you so much--this is very helpful. I will experiment with smaller corpora. I appreciate your suggestion about other options (although for my purposes right away, the simplicity of the topic modeling tool is of primary importance, so the Topics Explorer would be ideal). Looking more closely at the text in the Topics Explorer, I see that you do warn users about problems with larger corpora, so my apologies for overlooking this! I am always telling my students to read the documentation...

From the user perspective, a few possible/desirable behaviors for the tool in cases where the limit of display has been reached:

I have no idea if these are helpful or feasible, and you have probably already thought of them as well if so.

severinsimmler commented 6 years ago

Thank you very much for these valuable suggestions, Julia. We will definitely try to incorporate them into our further development – hopefully already in the next release.

severinsimmler commented 6 years ago

I'm closing this since I split your input into smaller working packages (#61, #62, #63, #64), but feel free to open a new issue. Thanks again.