UKPLab / argument-reasoning-comprehension-task

The Argument Reasoning Comprehension Task: Source codes & Datasets
Apache License 2.0
72 stars 14 forks source link

Room For Debate dataset Construction #2

Open rudra0713 opened 2 years ago

rudra0713 commented 2 years ago

Hi, I am planning to create a dataset from the Room For Debate NYT website that contains claims and articles and the stance label of the article towards the claim. As an example, given the following URL, I am planning to extract the claim, article and if possible, the stance label.

URL: https://www.nytimes.com/roomfordebate/2017/01/09/can-india-put-an-end-to-identity-politics/the-indian-courts-ban-on-identity-politics-will-have-unintended-consequences

Claim: Can India Put an End to Identity Politics?

Article: The Indian Court’s Ban on Identity Politics Will Have Unintended Consequences. Last week’s 4-part, 113-page 4-3 ruling by India's Supreme Court banning appeals to identity in electoral politics is well-written and grounded in compassion. It is also grossly misguided. The opinion hinges on appeals “by a candidate or his agent or by any other person with the consent of a candidate on the ground of his religion, race, caste, community or language,” but the justices have not made clear what would constitute “consent” or how precisely to determine who is a candidate's agent. Will candidates now seek to weaken their opponents by hauling them to court on the basis of something one of their supporters says? Will media houses, already regularly accused on being biased toward one side or the other, now be considered candidate’s agents? What happens if a journalist then seeks to advocate for a particular community in need? .......... The net result, in short, will be four-fold. Politics in India, already very rough, will become blood sport, as candidates and their party machines seek to use the judgement to disqualify their challengers. Social justice advocates, already in a precarious position, will be further pushed onto the back foot. The ruling will likely be ultimately ignored, because it is so broad that it is unenforceable. And this will, in turn, have the unfortunate effect of undermining the legitimacy of the court itself, and erode the already weak faith in institutions further — something democratic societies throughout the world can ill afford right now.

Stance Label: Con/ Disagree

I have read your documentation but I am confused about a few things:

  1. In https://github.com/UKPLab/argument-reasoning-comprehension-task/tree/master/roomfordebate/src/main/resources, you have added the URLs for many debates. Have you crawled the claims and articles too? If so, where can I find that in this Github repo?

  2. Have you used crowdsourcing or any other method to determine the stance of the article towards the claim? If so, where can I find that in this Github repo?

Thanks in advance.

habernal commented 2 years ago
  1. As far as I remember, I compiled the list of all URLs by simply crawling the entire NYTimes room for debate subsection using Apache Nutch (externaly, not part of this repo). Then I extracted these URLs using https://github.com/UKPLab/argument-reasoning-comprehension-task/blob/master/roomfordebate/src/main/java/de/tudarmstadt/ukp/experiments/roomfordebate/URLsFromWarcExtractor.java . If you know your URLs, there's no need to do this step.

  2. Have a look at https://github.com/UKPLab/argument-reasoning-comprehension-task/blob/master/roomfordebate/src/main/java/de/tudarmstadt/ukp/experiments/roomfordebate/DebateFetcher.java -- it takes a URL (such as in your example, https://www.nytimes.com/roomfordebate/2017/01/09/can-india-put-an-end-to-identity-politics/the-indian-courts-ban-on-identity-politics-will-have-unintended-consequences ) and downloads the full article including all comments as html. It relies on Selenium, as NYTimes comments are javascript. Note: This worked five years ago, but the site might have drastically changed since then.

  3. I extracted full content from each HTML page (article, stance, comments, etc.) using https://github.com/UKPLab/argument-reasoning-comprehension-task/blob/master/roomfordebate/src/main/java/de/tudarmstadt/ukp/experiments/roomfordebate/DebateHTMLParser.java , it outputs structured XML, but I can't find any example now, sorry). This should answer your second question: The stance of the article is determined by its sub-title. There are always two opposing sides and one has to figure out what is the main claim ("we should ban private schools") and then what is pro and what is con. I did this manually, I thinks this was the file: https://github.com/UKPLab/argument-reasoning-comprehension-task/blob/master/roomfordebate/src/main/resources/rfd-controversies/rfd-manual-cleaning-controversies.tsv

We did plenty of crowdsourcing but on the comment level, not for articles. See the NAACL paper for details.

Hope it helps!

habernal commented 2 years ago

I had a second look and in fact we didn't have stances annotated for each article, we were interested in the comments only.

I've uploaded the full Room for debate dataset as a XML, see https://github.com/habernal/argument-reasoning-comprehension-task/tree/master/roomfordebate/raw-data-room-for-debate-2010-2016-polar-questions

rudra0713 commented 2 years ago

Thanks a lot for the detailed response. I have found the debate title and articles from the last link you shared.

habernal commented 2 years ago

You're welcome! Citing our NAACL paper is much appreciated, see the readme. Closing this issue.