Add a FAQ page - Githubissues

To add the FAQs Alyssa designed

https://docs.google.com/document/d/1dzqKNZjOk_bhR_S6KQ6P7Wdywk7YLOcCfDRbxE3m3go/edit?usp=sharing

Q&A

General

I do not have a computational background, can I use your system? Absolutely! If you follow the instructions, you should be all set. The system is specifically designed to be usable for researchers without computational backgrounds. But we do recommend you read research papers about the applications of the computational methods used here in the field of journalism and communication research. See the recommended readings below. Do you have any suggested research papers about the methods used in this project? Akyürek, A. F., Guo, L., Elanwar, R., Ishwar, P., Betke, M., & Wijaya, D. (2020). Multi-label and multilingual news framing analysis. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8614–8624. Will compile a list of papers. You offer several options: LDA topic modeling, as well as two ways of using BERT. What is the difference between these methods? Which one should I choose? This involves a bit of nuance, of course, but generally speaking: LDA is best suited for discovering topics in your corpus of documents. If you’re trying to understand questions like “What are people talking about when they discuss this issue?” and “What sorts of patterns come up in the news in this issue?” then LDA is probably a good fit. If you’re trying to understand what specific frames are being used in a large set of documents, and you have a good sense of what those frames look like, you’ll want to use our BERT-based classifier instead. If you have a subset of these documents labeled with the primary frames they employ, then you can train a classifier to help you answer questions like “What fraction of my dataset involves this topic?” or “What frames does this novel dataset employ, with respect to the frames employed in this dataset I have already labeled?”. And finally, if you’re studying a topic that we’ve already built a classifier for (gun violence, same-sex marriage, and climate change), then you can simply upload your unlabeled documents and get back framing labels for each document. What is the difference between unsupervised and supervised machine learning? Supervised machine learning involves user input on the dataset. The user says to the computer, “I have a pattern that I want you to understand. Here are some examples of that pattern.” and the computer discovers the pattern involved as best it can. Unsupervised machine learning does not involve user input on the dataset. The user wants to discover patterns in the dataset at a scale or level of complexity that may not be easy for humans to perceive, but they do not tell the computer what those patterns are. Instead, the computer tells the user what patterns it discovers. What do you mean by a “document”? A document is a single unit of text - a standalone entity. Examples of documents include news articles, tweets, and paragraphs. Is there a requirement for the data types? Can I use your system to analyze tweets? Absolutely! Keep in mind that longer documents tend to be easier to analyze for computers (and, often, humans), but analyzing tweets is quite doable for the models we have implemented here. Is there a requirement for the minimum and maximum length of each document? Each document should be around the size of a tweet; single words or phrases will not yield a productive analysis. Longer documents (a corpus of books, for example) can be analyzed, but our system may take a very long time to handle the load. Your system is called OpenFraming. Can I use your system to do other types of analysis, e.g., sentiment detection? Yes - the BERT-based classifier is fairly flexible and can learn many different classification patterns. As long as you provide coherent labels for classification training, it should be able to pick up on those patterns and do sentiment detection, for example. How long will it take to complete the analysis? [TODO: get an answer from David]

LDA What is the minimum and maximum number of documents I should upload? TODO: get answer (pertaining to the backend) from David What is a LDA “topic”? Is it equivalent to “frame”? An LDA topic is not necessarily equivalent to a frame. Topics don’t necessarily correspond to a way of talking about or understanding an issue; they might instead correspond to different aspects of the issue, or different events that are covered as part of the issue.

BERT using your own labeled data How many labeled documents should I provide? TODO: get answer from David. ~500 seems to be where quality really starts to fall off. How many unlabeled documents can your system predict? TODO: get answer from David. Want to set a limit that is reasonable and won’t make things explode. What do you mean by a training and a testing set? A training set is the dataset the model uses to learn the pattern it is supposed to understand. A testing set is the dataset that the model uses to figure out how well it has learned this pattern. The training set and the test set should come from the same corpus, so they use the same pattern, and the model should not see examples from the test set during training. Otherwise it could just memorize the test set examples, which would prevent it from generalizing to other unseen data later on. What do I do if the model performance is not good? You can try a few things. If you have a lot of label categoriess, it might be helpful to combine some label categoriess that are related into a single categorylabel; this makes labeling an easier task for the model. For example, “unemployment” and “tax” can be combined into one category “economy.” You can also add more labeled examples. If the pattern you are trying to teach the model is complex (the labeling task involves a lot of subtleties like sarcasm, or the documents are short), then you can try adding more labeled examples.

BERT using the existing models I have a dataset with news articles published during a specific data range. Can I use your model to generate reliable labels? [TODO: can you explain what you mean by “reliable” - what dataset are you going to be generating labels from?] I used one of the models to predict labels for my data, and I find the labels assigned to some documents do not make sense. Why does that happen? Computers make mistakes, and sometimes a label that makes a lot of sense to a computer may not make sense to us. Also, our training data may not be as recent as the data you are doing inference on. If an issue has changed drastically in recent times, we may not have retrained our pretrained models to reflect that.

I have a sample of labeled data that pertains to one of the topics. Should I use the existing models, or is it better to train a new model? That depends on your purposes. Try using the existing models first. Are the frames similar to the frames you’re thinking about? Do the labels make sense for the work you are trying to do? If not, it may be necessary for you to label your own dataset and train a new model. Otherwise, it makes more sense to use the existing model.

dnaaun / openFraming

Add a FAQ page #128