Paper: questions on Query section

gesistsa / sweater

👚 Speedy Word Embedding Association Test & Extras using R

GNU General Public License v3.0

27 stars 4 forks source link

Paper: questions on Query section #13

Closed cmaimone closed 2 years ago

cmaimone commented 2 years ago

For this:

sweater uses the concept of query [@badilla2020wefe] to study the biases in $w$. A query contains two or more sets of seed words with at least one set of target words and one set of attribute words. sweater uses the $\mathcal{S}\mathcal{T}\mathcal{A}\mathcal{B}$ notation from @brunet2019understanding to form a query.

Need: concept of a query (missing a)
Why is STAB in mathematical notation?
Are target words and attribute words types of seed words? I think so, but that could be clearer

I would also find a little more info on target and attribute sets helpful. When you're supposed to supply two different sets, what is each supposed to be? What should be in S and what in T? I appreciate the references, and realize this may be complicated. Some type of brief summary here would help though. For example, for A and B, it seems each should be a set of words relevant to a group? Or the endpoints of a scale?

When you say target words shouldn't have bias, does that mean they are the words you're testing for bias?

cmaimone commented 2 years ago

This is re: https://github.com/openjournals/joss-reviews/issues/4036

chainsawriot commented 2 years ago

@cmaimone Thanks for the comments. I've rectified most of your points by explaining the seed words, why some words can be used as target words, etc in the paper branch.

c4fa5ceddf42ade4d691640b436f86b545b84861

Regarding the STAB notation, I believe @brunet2019understanding select the four characters out of convenience (similar to the choices by Calikskan et al., XYAB; or V_m, V_1, and V_2 by Garg et al.) and there's no explanation on why those four characters were chosen.

The section in @brunet2019understanding about the notion

I can use some more descriptive names (such as target_words_1, target_words_2, attribute_words_1, attribute_words_2). But again, it breaks legacy code. And I think the documentation, the README, and now the paper have done a decent job in explaining what those 4 word sets are.

cmaimone commented 2 years ago

I wasn't suggesting renaming them - just maybe making it clearer what each is, since they are abstract letters. I figured it out eventually, but I got lost on my first read through the paper/example code.

For someone who hasn't read the original papers (I have intentionally not done so at this point since I think others will come to your package without having done so), the query section was hard to work through. It was clearer once I had an example. Maybe a brief example before the usage section would help -- not necessarily in code. It would help to know what we're doing before knowing the functions/code to do it. Then you could map the STAB letters back to the example.

This would, I think, also help people who have read some of these original papers, yes? Because the STAB notation is not consistent across them?