HedvigS / gramfinder-typology-of-terminology-and-searching-OCRed-grammars

0 stars 0 forks source link

distributional semantics #4

Open HedvigS opened 8 years ago

HedvigS commented 8 years ago

@d97hah suggested that we could also use Latent Semantic Analysis or Random Indexing to compare similarity between texts.

skalyan91 commented 8 years ago

Or rather, to measure the semantic similarity between a search term and the chunks of a text (as opposed to directly searching for that term or its equivalents). This is a "bottom-up" approach that could save us the effort of coming up with exhaustive regexes.

HedvigS commented 8 years ago

Yes, right. Sorry i wasn't clear there.

We'd still have to define a number or search terms though, and separate into different in_languages, right?

skalyan91 commented 8 years ago

True; probably the "most common" way of labelling each concept (however we determine that).

HedvigS commented 8 years ago

Can someone clarify to me what would be the exact advantages of using a distributional semantics approach for multiple languages rather than search for certain distributions of known key words? Ping @d97hah

d97hah commented 8 years ago

Isn't that obvious? You catch synonyms and related words.

skalyan91 commented 8 years ago

What do you mean by "using a distributional semantics approach for multiple languages"? We wouldn't be able to search in French and German using just the English term, if that's what you meant. We would still have to search separately in English, French and German. But within each language, we would just need one search term (or maybe a couple), and the program would automatically find synonyms and related terms.

HedvigS commented 8 years ago

By multiple I meant separately, sorry for being unclear.

It is not entirely clear to me that many of the texts will be sufficiently long and similar to each other for this to be beneficial enough, it seems difficult to me to monitor and modify the parameters to make it useful across many different types of language descriptions. I take it that usually this is done with larger text masses, no?

I only ask because I don't know of any study doing it on this type of material, is that a weird question?

HedvigS commented 8 years ago

To illustrate, during the pilot coding phase for the inter-coder-reliability these were some of the grammars of the languages randomly assigned to me:

loeweke-may_kaukombaran1982.pdf lorimer_burushaski1935.pdf

They are very different. A simple count of certain terms with relative frequency makes more sense to me than distributional semantics for these two, because I'm mainly familiar with distributional semantics from large scale monolingual corpora or parallell corpora where more things can be controlled for. Now, it might be that these are two of the extreme ends, that more texts in our sample are more comparable than I think. Furthermore, it might also be that one can modify the technique in ways I'm unaware of that makes this more sensible. Hence my question, what are the exact advantages in this context compared to freq of terms?

I just thought it would be beneficial to talk about this, so we can move forward soon? If we're going for distributional semantics, I'll update the features differently than if we're going for regexes.

d97hah commented 8 years ago

@Hedvig do you know what distributional semantics is (as in e.g. [1])? Your comment on parallel corpora seems to suggest otherwise. As far as I am aware, the set of OCRed descriptions of English, French, etc respectively each constitute exactly a "large scale monolingual corpus", larger than the first sets of texts where distributional semantics were first applied.

[1] @PhdThesis{cl:Sahlgren:Word-Space, author = {Magnus Sahlgren}, title = {The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces}, school = {Stockholm University}, year = 2006, address = {Stockholm} }

HedvigS commented 8 years ago

@d97hah Yes, I am familiar with this but I don't use it myself hence me asking about how it would work in this context. (I don't see why the work that Wälchli et al do in parallell texts does not broadly fall within distributional semantics (cf Verkerks PhD thesis, Sinha & Kuteva 1995 etc), but that's rather besides the point here so let's not dwell on that.)

Either way, you are saying that you believe that the material is large enough and similar enough across different documents for this to work and be advantageous? And furthermore, we are to regard all texts in one language as one set rather than each document, or documents from a certain time etc. Is this correct? There are many assumptions involved here, and it is best if we make them explicit. Language descriptions are a special type of text, i'm just trying to get my head around what that means for this method. I just want to understand what are the factors involved in this case and move forward.

I know many things might seem obvious or trivial to you, and if you prefer you.pl can do this bit on your own instead. Just let me know what the next step is where me and the others can be useful.

skalyan91 commented 8 years ago

@HedvigS Yes, we would consider all language descriptions in (say) English as one set. (Actually, we could experiment with a more refined division, e.g. grammars of Australian languages in English, grammars of West African languages in French, etc., to capture local differences in terminology–but that's for later.)

The idea is that we would chunk every description into paragraphs, and combine all paragraphs across documents, for the purpose of extracting meaning from each paragraph. Then we'd take a search term, and compute its similarity to every paragraph in each document.

@d97hah correct me if I've misunderstood.

HedvigS commented 8 years ago

@skalyan91 Ok, that makes sense.

I'm still not clear on how that tackles the fact that this should be a very different kind of material that for example Sahlgren used, but nevermind. I'll leave this for now, I'm not getting anywhere with it.

Would you like to first proceed with the 19 features we worked with before until the Grambank questionnaire is finalised? I.e. these: https://docs.google.com/spreadsheets/d/1k_6BuQbOYOTURIfcS5WGk4YeppyjzPbfHXtrXqZ-O5k/edit?usp=sharing