IIT-ML / interpretable-text-classification

Interpretable and cautious text classification
1 stars 0 forks source link

Identify additional datasets #5

Open m-bilgic opened 5 years ago

m-bilgic commented 5 years ago

The XAI submission had only sentiment analysis experiments. We need to find other text classification datasets besides sentiment analysis. Please write your suggestions below.

m-bilgic commented 5 years ago

Here are some ideas:

  1. News categorization.
  2. Content-based news recommendation: based on a user's liked and disliked news documents, recommend additional news documents. Interpretability is quite important for this domain.
  3. Scientific paper categorization (e.g., arXiv).

Please add your suggestions.

m-bilgic commented 5 years ago

Note that the original dataset that you propose can be non-binary; no problem with that. Our experiments will binary classification; all we need to do is "one class" versus "the rest". So, do not limit yourself with binary classification datasets when you are proposing a domain/dataset.

annekehdyt commented 5 years ago
  1. Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR

  2. Abstract dataset 7 classes (CS, EE, etc) consists of 45K abstracts

  3. AGnews Corpus Dataset This dataset is the most popular. Has 1 million reviews. Category classification

I will post other dataset when I found another.

annekehdyt commented 5 years ago

Another common dataset for movie recommendation is MovieLens They released several version {1M, 10M, 20M} based on their density. I can't find any news recommendation dataset.

annekehdyt commented 5 years ago

In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs

Note that if we do this, we are no longer able to completely compare with LR result.

Let me know what you think

m-bilgic commented 5 years ago
  1. Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR

Sounds good. Let's first try a logistic regression and see what the top 100 keywords are before we do more testing.

  1. Abstract dataset 7 classes (CS, EE, etc) consists of 45K abstracts

Sounds good. Let's pick a binary split. Options: 1) one class versus the remaining six classes, 2) one class versus another class. I like "one class" versus "another class" because the top features for "another class" would be more meaningful than the top features for the "remaining six classes", but I can be convinced otherwise. Suggestions?

  1. AGnews Corpus Dataset This dataset is the most popular. Has 1 million reviews. Category classification

What are the categories?

m-bilgic commented 5 years ago

Another common dataset for movie recommendation is MovieLens They released several version {1M, 10M, 20M} based on their density. I can't find any news recommendation dataset.

I'm familiar with the MovieLens dataset. I believe it is not appropriate for this project; movie recommendations are based on many other aspects of the movie (actors and actresses, director, popularity, etc.) besides its plot.

m-bilgic commented 5 years ago

In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs

Note that if we do this, we are no longer able to completely compare with LR result.

Let me know what you think

I'm not sure if this is necessary. I believe the following would be sufficient:

  1. IMDB movie reviews dataset we have now.
  2. A news classification domain.
  3. A paper classification domain.

I'd also like to have a high-impact domain if possible, such as legal documents and so on.

annekehdyt commented 5 years ago

In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs Note that if we do this, we are no longer able to completely compare with LR result. Let me know what you think

I'm not sure if this is necessary. I believe the following would be sufficient:

  1. IMDB movie reviews dataset we have now.
  2. A news classification domain.
  3. A paper classification domain.

I'd also like to have a high-impact domain if possible, such as legal documents and so on.

I found Legal case report dataset I haven't look inside how many categories they have. But, This is one of the possibility.

m-bilgic commented 5 years ago

In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs Note that if we do this, we are no longer able to completely compare with LR result. Let me know what you think

I'm not sure if this is necessary. I believe the following would be sufficient:

  1. IMDB movie reviews dataset we have now.
  2. A news classification domain.
  3. A paper classification domain.

I'd also like to have a high-impact domain if possible, such as legal documents and so on.

I found Legal case report dataset I haven't look inside how many categories they have. But, This is one of the possibility.

Thanks. How are you looking for datasets? I see three possibilities:

  1. Websites that host datasets
  2. Google search
  3. Google Scholar

If you haven't tried Google Scholar yet, try it. You can find papers on domains we are interested in, and you can find the dataset link/citation in the paper.

annekehdyt commented 5 years ago

How are you looking for datasets? I see three possibilities:

  1. Websites that host datasets
  2. Google search
  3. Google Scholar

If you haven't tried Google Scholar yet, try it. You can find papers on domains we are interested in, and you can find the dataset link/citation in the paper.

I use three of those. Most of big news classification dataset, I got from papers mostly from NAACL that has comparison of NLP tasks. For specific dataset, I use 2) Google Search and 1) Website that host datasets.

Also, I found this very useful website and I am following their slack. https://paperswithcode.com/

For example, they keep on tract the performance rank on each dataset across time. Check for IMDB dataset here: state-of-the-art IMDB dataset

m-bilgic commented 5 years ago

How are you looking for datasets? I see three possibilities:

  1. Websites that host datasets
  2. Google search
  3. Google Scholar

If you haven't tried Google Scholar yet, try it. You can find papers on domains we are interested in, and you can find the dataset link/citation in the paper.

I use three of those. Most of big news classification dataset, I got from papers mostly from NAACL that has comparison of NLP tasks. For specific dataset, I use 2) Google Search and 1) Website that host datasets.

Also, I found this very useful website and I am following their slack. https://paperswithcode.com/

For example, they keep on tract the performance rank on each dataset across time. Check for IMDB dataset here: state-of-the-art IMDB dataset

Paperswithcode website looks pretty useful. Thanks.

Please also try a few phrases on Google Scholar. When I searched for legal document classification in Google Scholar, the following is the top result: https://onlinelibrary.wiley.com/doi/full/10.1002/asi.21233 It talks about information retrieval, but I'm guessing we might be able to repurpose it for classification.

Other queries on Google Scholar might reveal other papers with datasets.

annekehdyt commented 5 years ago
  1. Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR

Sounds good. Let's first try a logistic regression and see what the top 100 keywords are before we do more testing.

Please check on :

  1. Data summary
  2. Logistic regression result with top 100 keyword here
annekehdyt commented 5 years ago

I'll keep an update on the dataset and create notebook for other scientific data as well that I mentioned above. Meanwhile, I'll put a (cleaner) list of possible dataset on legal documents and other famous text classification data.

mitchellzhen commented 5 years ago

For the news classification, I found these datasets:

  1. FakeNewsChallenge: news articles classified by stance
  2. Liar: political statements from politifact analyzed by their truthfulness

For high impact, I thought the medical field would be a good domain, but it is hard to find text data.

annekehdyt commented 5 years ago
  1. FakeNewsChallenge: news articles classified by stance

I think we can inspect this data and see the top 100 words first on each stances. However, this dataset is highly unbalanced, but still we can try :)

  1. Liar: political statements from politifact analyzed by their truthfulness

I saw a bit on the dataset, it's average length of each statement is 17. I am not sure if we can use this, since the number of keyword would be low.

Let me know what you think

m-bilgic commented 5 years ago
  1. Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR

Sounds good. Let's first try a logistic regression and see what the top 100 keywords are before we do more testing.

Please check on :

  1. Data summary
  2. Logistic regression result with top 100 keyword here

Thank you. As I suspected, it is difficult to figure out keywords for accept versus reject.

m-bilgic commented 5 years ago

For the news classification, I found these datasets:

  1. FakeNewsChallenge: news articles classified by stance
  2. Liar: political statements from politifact analyzed by their truthfulness

Yes, these are definitely high impact, but I can't imagine what the keywords would be for "fake/not-fake" or for "lie/truth" classification.

m-bilgic commented 5 years ago

Moving forward, I think we need domains where we can find meaningful keywords for. Going back to my earlier comment:

  1. IMDB reviews - we already have this.
  2. News categorization - not super high impact, but still important because that's how news apps and websites sort news into categories and that's how many people receive/read their news. Regarding keywords, I think we can easily find keywords for news categorization (for e.g., politics vs technology, etc.)
  3. Paper categorization - again, not super high impact, but easy to grasp. When is this used? Typically for information retrieval. Imagine a company/lab is interested in finding papers regarding a medical condition and they'd like to retrieve research on about that medical condition: they use a search engine to retrieve several results and then they use automated categorization, etc. To make it easy for us to generate the keywords, we can choose a CS domain (e.g., security vs databases). Where is the data? Well, arxiv has a lot of papers, where categories already exist, and the titles and abstracts are available. Someone either has crawled already or we can crawl it.
annekehdyt commented 5 years ago
  1. News categorization - not super high impact, but still important because that's how news apps and websites sort news into categories and that's how many people receive/read their news. Regarding keywords, I think we can easily find keywords for news categorization (for e.g., politics vs technology, etc.)

I made summary of a couple news category as follow.

  1. AGnews
  2. DBpedia <--- this is more like a wikipedia article classification

As you said, the keyword for news category does make sense and we can use 1 vs rest classification.

  1. Paper categorization - again, not super high impact, but easy to grasp. When is this used? Typically for information retrieval. Imagine a company/lab is interested in finding papers regarding a medical condition and they'd like to retrieve research on about that medical condition: they use a search engine to retrieve several results and then they use automated categorization, etc. To make it easy for us to generate the keywords, we can choose a CS domain (e.g., security vs databases). Where is the data? Well, arxiv has a lot of papers, where categories already exist, and the titles and abstracts are available. Someone either has crawled already or we can crawl it.

I modified a bit on paper data on PeerRead data. I only get papers from arxiv category (three categories) Since three of the categories lies on CS area, some of the keyword seems overlapping. But this is better than accept/reject data.

  1. Arxiv Category

Let me know what you think.

m-bilgic commented 5 years ago

A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.

annekehdyt commented 5 years ago

A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.

Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers

annekehdyt commented 5 years ago

Additionally, For news classification dataset, I am thinking that AGnews is better. And very common use for classification as baseline. What do you think?

m-bilgic commented 5 years ago

A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.

Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers

Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.

m-bilgic commented 5 years ago

Additionally, For news classification dataset, I am thinking that AGnews is better. And very common use for classification as baseline. What do you think?

Sounds fine; one caveat: when we do class "A" versus the "rest", it is easy to figure out what the keywords for class "A" are but it is hard to figure out what the keywords for the "rest" are. Should we consider class "A" versus class "B" where we know keywords for both class A and class B? In that case, what classes would you choose for "A" and "B"?

annekehdyt commented 5 years ago

A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.

Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers

Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.

OK Professor. I will fetch data on 2018 range. I will report the final data and continue with the experiment

m-bilgic commented 5 years ago

A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.

Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers

Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.

OK Professor. I will fetch data on 2018 range. I will report the final data and continue with the experiment

OK, thanks. Which two classes will you fetch? Choose two for which you can recognize the keywords. Also, how many documents, per class, are you planning to fetch? I don't know if 2018 will be too few or too many.

annekehdyt commented 5 years ago

A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.

Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers

Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.

OK Professor. I will fetch data on 2018 range. I will report the final data and continue with the experiment

OK, thanks. Which two classes will you fetch? Choose two for which you can recognize the keywords. Also, how many documents, per class, are you planning to fetch? I don't know if 2018 will be too few or too many.

I am not sure either since the class physics has around 5K papers in the span of three months. I will try to fetch for 2018 and see the total documents.

For now, I would fetch classes of cs.AI and cs.CR (cryptography and security) and see the keyword. But, for backup plan, I may want to fetch several classes under CS. (FYI cs.AI 2 months has 80 papers. Thus we can have +/- 480 papers per sub category in 2018 time range)

What do you think of fetching other than CS as well? (I am not quite sure about the keyword related to those field) Do you think that is worthed to try?

m-bilgic commented 5 years ago

I am not sure either since the class physics has around 5K papers in the span of three months. I will try to fetch for 2018 and see the total documents.

Physics??

For now, I would fetch classes of cs.AI and cs.CR (cryptography and security) and see the keyword. But, for backup plan, I may want to fetch several classes under CS.

cs.AI vs cs.CR sounds good.

(FYI cs.AI 2 months has 80 papers. Thus we can have +/- 480 papers per sub category in 2018 time range)

Do we have to choose only 2018? Can't we go back in time to fetch more papers?

What do you think of fetching other than CS as well? (I am not quite sure about the keyword related to those field) Do you think that is worthed to try?

Per my earlier comment, I think we should choose "A" versus "B" where we can identify keywords for both "A" and "B" instead of "A" versus "others". In terms of A and B, I like cs.AI versus cs.CR.

We also need a dataset that is not super easy. I don't know what the accuracy on cs.AI versus cs.CR would be. If the accuracy is really high, we might choose topics that are bit more similar. like cs.AI versus cs.DB.

annekehdyt commented 5 years ago

Do we have to choose only 2018? Can't we go back in time to fetch more papers?

No. I've collected all paper under cs categories (each paper has its own subcategory, so we can easily extract cs.AI and cs.CR, cs.DB etc) for 2018. Total document: 43643 I am currently scraping 2015 - 2017. I will see how many documents we have. (to make sure that each sub category has sufficient amount). We can choose paper at any time. Now I am doing by year.

m-bilgic commented 5 years ago

Distribution of categories. AGNews: sports vs sci/tech is a viable choice arxiv: ai vs crypto or ml vs crypt sounds reasonable, but we need numbers first.

annekehdyt commented 5 years ago
  1. AGnews dataset for training set.

image

  1. Arxiv paper image

I am still working on figuring the distribution of each categories for arxiv. I will upload today. Then, will update on the top words class1 vs class2

annekehdyt commented 5 years ago

For arxiv dataset, this is the complete of category :

image

Based on the total number of papers, Machine learning has the highest number, I suggest we can use Machine Learning vs Information Theory. However, I am not sure until I print the top keywords. I will give you more update today.

m-bilgic commented 5 years ago

Thank you. AI vs Crypto doesn't sound bad, either. Information Theory can be hard to identify.