Open jamesallenevans opened 4 years ago
Jurafsky, Daniel & James H. Martin. 2017
This is a very simple question, but I don't think I understand what this description means: "Gazetteer and name features are typically implemented as a binary feature for each name list". Does the binary feature represent whether a certain name appears in the input sentence? Or this is how we store the whole name list (like a one-hot coding)?
In NER evaluation: "using entities as the unit of response but words as the unit of training means that there is a mismatch between the training and test conditions." Is this a problem we need to concern? Or we can ignore this mismatch?
In general, methods for IE need features at word level as inputs (as well as punctuations, prefix, suffix, etc.) and classify the outputs at the word level too. This works well for English. For languages that cannot be easily tokenized and have isolated morphology, how would we proceed to conduct IE? Do you suggest designing the tasks at the character level instead of at the word level? Would the character level be able to capture the pattern at the word level?
For Jurafsky, Daniel & Martin (2017)
I am curious if there is a way to group the events that were extracted. There are specific events that are bind to a specific date and time, and there are more general words that denote a wider time-space (ex. Lehman brothers filing bankruptcy in 2008 - The recession in 2008; The battle of Midway - The Pacific War - World War II). I wonder if there is a way to extract this kind of relationship and organize them in a hierarchical way.
Similar to attaching temporal information to events, I wonder if there is a way to attach spatial or geographical information to events, too. For example, "The graduation ceremony of UChicago" (which sadly is cancelled this year) could be attached to Chicago, IL.
+) I think @wanitchayap's third question is highly interesting, too. Many languages including Chinese-influenced language (Mandarin, Japenese) do not have spacing in their written language, and some languages that incorporated spacing in the later development of language (e.g. Korean) have a very weird spacing rule that even native speakers can't follow perfectly. I am very curious about how tokenization can be done in languages like this.
Jurafsky & Martin 2017
In the evaluation of NER part, it is mentioned that named entity tagging is likely to be more than one word, which may cause repeated error problems when we evaluate it. Is there any ways that could get rid of this kind of problems?
As an extention to order the events in a time order, could we further detect the relationships between the events and order them logically? For example, in the sample text in page 22, the fare increase of American Airlines may be a follow-up of the fare increase of United Airlines, and the latter might be caused by high fuel prices. Hence, we may want to construct a logic order of events.
For Chapter 3, Foundations of Statistical Natural Language Processing
It is a quite informative and useful chapter, introducing how to decompose the sentences into various components in order to understand the meaning. Here I have two specific questions:
One big concern mentioned in this chapter is syntactic ambiguity, and the phenomenon of garden pathing is discussed. I agree with the point in the chapter that some reduced forms of clauses could probably confuse the whole parsing path (The example in the book is "The horse raced past the barn fell."). This problem could be severer when there are clauses in clauses in long sentences. I wonder whether such an issue is ameliorated due to the development of analyzing methods like machine learning, as this book was written in 1999?
The parsing structure in this chapter is mainly based on the English grammar, which motivates me to think about how to deal with the languages with different grammar structures, i.e. Asian languages like Chinese and Japanese. Should we apply NLP case by case to languages with different patterns, or should we anticipate a more inclusive algorithm to parse various languages?
For Chapter 5, Foundations of Statistical Natural Language Processing
I'm quite inspired by this chapter as it applies many statistical methods to the collocation analysis. Here are the questions,
The chapter provides many methods to identify potential collocations, but I wonder whether there are clear comparisons among all these methods? In other words, when we try to apply these methods into our own projects, what method is most preferred. A following question is, what if we try all the methods but get inconsistent outcome to determine certain phrases? In such a case, what conclusion should we make?
The chapter mainly talks about how to identify the potential collocations, but how to get the non-compositional meaning after that? For example, after we confirm that "strong tea" is a collocation, how do we make sure what it really means afterwards, which should be considered in the next stage of text analysis of identifying the social relations/networks behind texts.
Jurafsky & Martin 2017
This was quite a detailed and technical text and went into quite a few interesting concepts. I am curious as to how the patterns are being used to extract the relations if they aren't very easily programmable to extract. In the other papers, we read, someone had mentioned the problem with sarcasm and tone which I think would also apply here beyond just extracting specific 'entities'. Also, if we're using something like this to analyze social media data we might run into the issues with relationally extracting entities based on grammar and general internet colloquialisms for example?
Jurafsky, Daniel & James H. Martin. 2017
I selected this fundamental reading because it speaks directly to an issue I have been thinking about for an early stage research project. Does the efficacy of the various types of information extraction discussed in the article depend on document type or format? For example, I have collected several PDFs of historical biographical directories of prestigious fellowship recipients (several hundred pages each). While the directories vary in the amount of biographical detail included for each scholar, they generally follow the same pattern for each individual entry. Below is an example.
How would information and relation extraction techniques deal with similar sources structured in column format? Motivated by @nwrim's first question about grouping extracted text, I am also interested in exploring methods to identify similarities and trends in the scholars’ career history patterns. Based on these potential groupings, I am interested in determining the potential effects of either formal government service or proximity to government on the scholars’ subsequent career.
On Information Extraction: for neural networks in NER, what is the efficient method to label training data? I assume there would be tremendous amount labeling in order for bidirectional LSTM to recognize entities.
Chapter 3, Foundations of Statistical Natural Language Processing:
In the process of parsing language, is there a higher weight placed on sentences and phrases that follow "ideal" grammar rules? For example, we're often taught that sentences should be more active ("Children eat sweet candy") rather than passive ("Candy is eaten by children"). While both are correct, the former is preferred. Given that, is there reason to think our tools are better at parsing the former, or should they be equally good are accurately parsing either since both are grammatically correct?
For Manning and Schütze (1999)
This is a really informative paper on linguistics and grammar. I have two general basic questions:
This is a question for all of the readings simultaneously. I am currently working on Soviet and Russian newspaper, which are obviously written in Russian. Many of the concepts presented, (NER, collocation, the linguistics not so much) are translatable to Russian simply because they are a theoretical framework, and are not necessarily reliant on a specific language, so long as there is a model available for that language. For Russian, I know of only a few models. Stanford NLP has a model for Russian. Spacy does not, but it allows for the possibility of building one. I know of several other models developed in Russia itself. I have several questions,
Admittedly, Information Extraction is a well-written technical textbook which took me a while to digest, and I am still not confident to say I understand most part of it.
My question relates to the recognition of named entity in the first section. In the textbook,
A named entity is, roughly speaking, anything that can be referred to with a proper name: a person, a location, an organization.
And this rule may extend to
dates, times, and other kinds of temporal expressions, and even numerical expressions like prices.
Under such guidance, in the sample text following the definition, the name of spokesman Tim Wagner has been identified. However, I wonder will the algorithm mark pronouns that probably occur in the proceeding passage that indicates Tim Wagner such as he, him, or the spokesman? Such concern comes from that in the exemplary text, not all nouns are recognized as named entities like lower-cost carriers.
Apart from the above question, I want to join the discussion of the processing of languages without spacing in sentences or hard to be tokenized. As far as I know, particular algorithms and special libraries (such as jieba
) has been developed to parse over Chinese text. However, not all studies utilize this implementation and I am looking forward to hearing people's ideas about it.
Jurafsky, Daniel & James H. Martin. 2017 It seems to be an excellent toolkit for getting to know something about information extraction. It seems that it applies a name - identity - relation - event approach which seems quite complex. I am wondering whether there is an easy alternative? Also, it seems that the failure in the prior step would lead to the failure in the later stage. Do we have any mechanisms to guarantee our accuracy?
Jurafsky, Daniel & James H. Martin. 2017
From this introductory chapter, it seems that the current techniques for information extraction are already pretty good. What are the remaining challenges to be solved, especially for the applications in social science?
From the fundamental readings, I see lots of stat techniques and I wonder if Bayesian methods can be applied in content analysis.
Jurafsky, Daniel & James H. Martin. 2017. I was wondering whether relation extraction can be done across texts. For example, there could be two news reports of a crime, one before the suspect is identified and the other after. In the former, the news will read something like "in case A, ? attacks C," whereas it is revealed as "in case A, B attacks C," in the latter. Will the algorithm be able to link these two reports together with case identifier A and fill in the blank of "?" with B?
I think these readings are really amazing and really open a new world for me! Especially, I found NER technique is really useful when I finished my homework.
But I think this technology may be easier to use in phonetic text (such as English). In ideographic characters (such as Chinese), this technology may have a higher error rate. For example, any two characters in Chinese can form a person ’s name. There are endless combinations of such combinations, which cannot be improved through the material library. I want to ask how NER will solve this problem?
Post questions here for one or more of our fundamentals readings:
Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press: Chapter 3 (“Linguistic foundations”): 81-113.
Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press: selections from Chapter 5 (“Collocations”): 151-163, 172-176, 183-186.
Jurafsky, Daniel & James H. Martin. 2017 (3rd Edition). Speech and Language Processing. Singapore: Pearson Education, Inc.: Chapter 18 (“Information Extraction”): 739-778.