JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

Compare 2 corpus over time #69

Closed MastafaF closed 3 years ago

MastafaF commented 3 years ago

Hi,

Let's suppose we have data with Democrats speeches and Republican speeches across time for January and February.

I would like to compare topics evolutions for Democrats and Republicans.

Therefore, I would like to have on the same graph we have currently (X axis Democrats and Y axis Republicans) the evolution of the topics across time for January and February. In other words, we could see (Topic 1 January) as a point in the 2D graph and (Topic 1 February) as a second topic in the graph. Similarly we would have (Topic 2 January) and (Topic 2 February).

Also, when clicking on (Topic 1 January) the indexed data should be just for the month of January. Therefore the view with the texts side by side should be able to filter based on the chosen month in addition to the topic keywords.

What would you recommend to achieve that as best as possible? 😄

JasonKessler commented 3 years ago

Hi Mastafa,

That sounds like an interesting project.

What I'd recommend is you look at using Scattertext's FourSquare/FourSquareAxis visualizations, documented in https://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Explore-Headlines.ipynb.

What this lets you do is pick two sets of contrasting categories (Democrat/Republican and Jan/Feb) and use them to position terms on a single chart. The x-axis would be the Dem/Repub term-association, and the y-axis the term's month association.

Alternatively, you could use a FourSquare, and make the corners of the charts associated with one of the four categories, and the axes be pairs of the categories (e.g., the x-axis would be Dem-Jan vs Dem-Feb and the y-axis would be Repub-Jan vs Repub-Feb.

These are all versions of semiotic squares, which were the inspirations for the charts. Please see http://www.signosemio.com/greimas/semiotic-square.asp for an introduction to semiotic squares.

Jason

MastafaF commented 3 years ago

Hi @JasonKessler , very interesting reading but I feel like it is not entirely intuitive for a new user.

Why not having a similar chart to what we currently have and just have some color indexing which would allow the user to understand that this given point/topic is from a given month?

That way, the user could easily see how the topic has moved across time? (Probably having something like an arrow which shows how the topic has moved from January to February in the 2-D graph could do the job).

JasonKessler commented 3 years ago

Will Hamilton has done something similar in studying diachronic shifts in word embeddings. You may want to look at some of his work.

On Sat, Oct 31, 2020 at 4:07 AM MastafaF notifications@github.com wrote:

Hi @JasonKessler https://github.com/JasonKessler , very interesting reading but I feel like it is not entirely intuitive for a new user.

Why not having a similar chart to what we currently have and just have some color indexing which would allow the user to understand that this given point/topic is from a given month?

That way, the user could easily see how the topic has moved across time? (Probably having something like an arrow which shows how the topic has moved from January to February in the 2-D graph could do the job).

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/69#issuecomment-719918722, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACMMXACB44OHCIJM3RFTDLSNPVV3ANCNFSM4SHS7HXQ .

MastafaF commented 3 years ago

Awesome, I just took a look at his histwords repo and visually it answers my ask. However, we would need to adapt that as my interest is around the frequency of a given topic rather than the general meaning of the word/topic as it is currently done by Hamilton. Any idea how we could adapt his work to the view we currently have on scattertext to view Topics?

JasonKessler commented 3 years ago

Unfortunately, I don't have the bandwidth to sketch out or think through such a proposal.

Why don't you code up a prototype, run it on some datasets, post the code and an overview, and then we can talk about refining it and integrating it into scattertext.

On Mon, Nov 2, 2020 at 3:36 AM MastafaF notifications@github.com wrote:

Awesome, I just took a look at his histwords repo and visually it answers my ask. However, we would need to adapt that as my interest is around the frequency of a given topic rather than the general meaning of the word/topic as it is currently done by Hamilton. Any idea how we could adapt his work to the view we currently have on scattertext to view Topics?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/69#issuecomment-720418172, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACMMXGGINJCZ3XN7SBTRF3SN2KTFANCNFSM4SHS7HXQ .

MastafaF commented 3 years ago

Hi @JasonKessler ,

Thanks for the suggestion, I don't have much experience in d3.js so I will do my best to come up with something simple as a v1. If you have some ideas on where to start from that would be much appreciated.

I feel like if I have some tag for each point on the month and the topic then it would be a great start. Therefore, I could iterate over the data points, order them by month, build an arrow from Month i to Month i+1.

Second step would be to filter out the text data based on month so that we have a different view when clicking on Topic 1/January and Topic 1/February.