Open m-bilgic opened 5 years ago
Here are some ideas:
Please add your suggestions.
Note that the original dataset that you propose can be non-binary; no problem with that. Our experiments will binary classification; all we need to do is "one class" versus "the rest". So, do not limit yourself with binary classification datasets when you are proposing a domain/dataset.
Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR
Abstract dataset 7 classes (CS, EE, etc) consists of 45K abstracts
AGnews Corpus Dataset This dataset is the most popular. Has 1 million reviews. Category classification
I will post other dataset when I found another.
Another common dataset for movie recommendation is MovieLens They released several version {1M, 10M, 20M} based on their density. I can't find any news recommendation dataset.
In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs
Note that if we do this, we are no longer able to completely compare with LR result.
Let me know what you think
- Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR
Sounds good. Let's first try a logistic regression and see what the top 100 keywords are before we do more testing.
- Abstract dataset 7 classes (CS, EE, etc) consists of 45K abstracts
Sounds good. Let's pick a binary split. Options: 1) one class versus the remaining six classes, 2) one class versus another class. I like "one class" versus "another class" because the top features for "another class" would be more meaningful than the top features for the "remaining six classes", but I can be convinced otherwise. Suggestions?
- AGnews Corpus Dataset This dataset is the most popular. Has 1 million reviews. Category classification
What are the categories?
Another common dataset for movie recommendation is MovieLens They released several version {1M, 10M, 20M} based on their density. I can't find any news recommendation dataset.
I'm familiar with the MovieLens dataset. I believe it is not appropriate for this project; movie recommendations are based on many other aspects of the movie (actors and actresses, director, popularity, etc.) besides its plot.
In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs
Note that if we do this, we are no longer able to completely compare with LR result.
Let me know what you think
I'm not sure if this is necessary. I believe the following would be sufficient:
I'd also like to have a high-impact domain if possible, such as legal documents and so on.
In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs Note that if we do this, we are no longer able to completely compare with LR result. Let me know what you think
I'm not sure if this is necessary. I believe the following would be sufficient:
- IMDB movie reviews dataset we have now.
- A news classification domain.
- A paper classification domain.
I'd also like to have a high-impact domain if possible, such as legal documents and so on.
I found Legal case report dataset I haven't look inside how many categories they have. But, This is one of the possibility.
In the meantime, I'd like to try on Amazon dataset( on a bigger pool) and Yelp dataset. Since Both dataset are big, I'm going to figure out how to process them by batch. Also, if we use bigger dataset, I need to train them by larger batch instead of 1 and different epochs Note that if we do this, we are no longer able to completely compare with LR result. Let me know what you think
I'm not sure if this is necessary. I believe the following would be sufficient:
- IMDB movie reviews dataset we have now.
- A news classification domain.
- A paper classification domain.
I'd also like to have a high-impact domain if possible, such as legal documents and so on.
I found Legal case report dataset I haven't look inside how many categories they have. But, This is one of the possibility.
Thanks. How are you looking for datasets? I see three possibilities:
If you haven't tried Google Scholar yet, try it. You can find papers on domains we are interested in, and you can find the dataset link/citation in the paper.
How are you looking for datasets? I see three possibilities:
- Websites that host datasets
- Google search
- Google Scholar
If you haven't tried Google Scholar yet, try it. You can find papers on domains we are interested in, and you can find the dataset link/citation in the paper.
I use three of those. Most of big news classification dataset, I got from papers mostly from NAACL that has comparison of NLP tasks. For specific dataset, I use 2) Google Search and 1) Website that host datasets.
Also, I found this very useful website and I am following their slack. https://paperswithcode.com/
For example, they keep on tract the performance rank on each dataset across time. Check for IMDB dataset here: state-of-the-art IMDB dataset
How are you looking for datasets? I see three possibilities:
- Websites that host datasets
- Google search
- Google Scholar
If you haven't tried Google Scholar yet, try it. You can find papers on domains we are interested in, and you can find the dataset link/citation in the paper.
I use three of those. Most of big news classification dataset, I got from papers mostly from NAACL that has comparison of NLP tasks. For specific dataset, I use 2) Google Search and 1) Website that host datasets.
Also, I found this very useful website and I am following their slack. https://paperswithcode.com/
For example, they keep on tract the performance rank on each dataset across time. Check for IMDB dataset here: state-of-the-art IMDB dataset
Paperswithcode website looks pretty useful. Thanks.
Please also try a few phrases on Google Scholar. When I searched for legal document classification
in Google Scholar, the following is the top result: https://onlinelibrary.wiley.com/doi/full/10.1002/asi.21233 It talks about information retrieval, but I'm guessing we might be able to repurpose it for classification.
Other queries on Google Scholar might reveal other papers with datasets.
- Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR
Sounds good. Let's first try a logistic regression and see what the top 100 keywords are before we do more testing.
Please check on :
I'll keep an update on the dataset and create notebook for other scientific data as well that I mentioned above. Meanwhile, I'll put a (cleaner) list of possible dataset on legal documents and other famous text classification data.
For the news classification, I found these datasets:
For high impact, I thought the medical field would be a good domain, but it is hard to find text data.
- FakeNewsChallenge: news articles classified by stance
I think we can inspect this data and see the top 100 words first on each stances. However, this dataset is highly unbalanced, but still we can try :)
- Liar: political statements from politifact analyzed by their truthfulness
I saw a bit on the dataset, it's average length of each statement is 17. I am not sure if we can use this, since the number of keyword would be low.
Let me know what you think
- Scientific peer reviews: class binary (accept/reject) Consists of 14K paper draft with accept/reject decision in top-tier venues including ACL, NIPS and ICLR
Sounds good. Let's first try a logistic regression and see what the top 100 keywords are before we do more testing.
Please check on :
Thank you. As I suspected, it is difficult to figure out keywords for accept versus reject.
For the news classification, I found these datasets:
- FakeNewsChallenge: news articles classified by stance
- Liar: political statements from politifact analyzed by their truthfulness
Yes, these are definitely high impact, but I can't imagine what the keywords would be for "fake/not-fake" or for "lie/truth" classification.
Moving forward, I think we need domains where we can find meaningful keywords for. Going back to my earlier comment:
- News categorization - not super high impact, but still important because that's how news apps and websites sort news into categories and that's how many people receive/read their news. Regarding keywords, I think we can easily find keywords for news categorization (for e.g., politics vs technology, etc.)
I made summary of a couple news category as follow.
As you said, the keyword for news category does make sense and we can use 1 vs rest classification.
- Paper categorization - again, not super high impact, but easy to grasp. When is this used? Typically for information retrieval. Imagine a company/lab is interested in finding papers regarding a medical condition and they'd like to retrieve research on about that medical condition: they use a search engine to retrieve several results and then they use automated categorization, etc. To make it easy for us to generate the keywords, we can choose a CS domain (e.g., security vs databases). Where is the data? Well, arxiv has a lot of papers, where categories already exist, and the titles and abstracts are available. Someone either has crawled already or we can crawl it.
I modified a bit on paper data on PeerRead data. I only get papers from arxiv category (three categories) Since three of the categories lies on CS area, some of the keyword seems overlapping. But this is better than accept/reject data.
Let me know what you think.
A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.
A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.
Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers
Additionally, For news classification dataset, I am thinking that AGnews is better. And very common use for classification as baseline. What do you think?
A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.
Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers
Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.
Additionally, For news classification dataset, I am thinking that AGnews is better. And very common use for classification as baseline. What do you think?
Sounds fine; one caveat: when we do class "A" versus the "rest", it is easy to figure out what the keywords for class "A" are but it is hard to figure out what the keywords for the "rest" are. Should we consider class "A" versus class "B" where we know keywords for both class A and class B? In that case, what classes would you choose for "A" and "B"?
A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.
Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers
Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.
OK Professor. I will fetch data on 2018 range. I will report the final data and continue with the experiment
A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.
Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers
Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.
OK Professor. I will fetch data on 2018 range. I will report the final data and continue with the experiment
OK, thanks. Which two classes will you fetch? Choose two for which you can recognize the keywords. Also, how many documents, per class, are you planning to fetch? I don't know if 2018 will be too few or too many.
A masters student worked with me last semester and she collected data (title and abstracts) herself from arXiv pretty easily. Can you check if you can collect/crawl data from arXiv? Obviously, rather than getting ALL papers from arXiv, we should choose two categories (e.g., cs.AI vs cs.DB) and get a decent number (I don't know how many) papers from each category. Check https://arxiv.org/help/bulk_data to see if it is useful.
Yes. I tried on some libraries and we can fetch the paper, even on the certain time range. :) BUT, we can only fetch on metadata, title and abstract, not the full paper. If the abstract is sufficient, I can start to collect the papers
Yes, the abstract is sufficient. Fetch the title (and maybe the author information) as well, but we'll work with the abstract.
OK Professor. I will fetch data on 2018 range. I will report the final data and continue with the experiment
OK, thanks. Which two classes will you fetch? Choose two for which you can recognize the keywords. Also, how many documents, per class, are you planning to fetch? I don't know if 2018 will be too few or too many.
I am not sure either since the class physics has around 5K papers in the span of three months. I will try to fetch for 2018 and see the total documents.
For now, I would fetch classes of cs.AI and cs.CR (cryptography and security) and see the keyword. But, for backup plan, I may want to fetch several classes under CS. (FYI cs.AI 2 months has 80 papers. Thus we can have +/- 480 papers per sub category in 2018 time range)
What do you think of fetching other than CS as well? (I am not quite sure about the keyword related to those field) Do you think that is worthed to try?
I am not sure either since the class physics has around 5K papers in the span of three months. I will try to fetch for 2018 and see the total documents.
Physics??
For now, I would fetch classes of cs.AI and cs.CR (cryptography and security) and see the keyword. But, for backup plan, I may want to fetch several classes under CS.
cs.AI vs cs.CR sounds good.
(FYI cs.AI 2 months has 80 papers. Thus we can have +/- 480 papers per sub category in 2018 time range)
Do we have to choose only 2018? Can't we go back in time to fetch more papers?
What do you think of fetching other than CS as well? (I am not quite sure about the keyword related to those field) Do you think that is worthed to try?
Per my earlier comment, I think we should choose "A" versus "B" where we can identify keywords for both "A" and "B" instead of "A" versus "others". In terms of A and B, I like cs.AI versus cs.CR.
We also need a dataset that is not super easy. I don't know what the accuracy on cs.AI versus cs.CR would be. If the accuracy is really high, we might choose topics that are bit more similar. like cs.AI versus cs.DB.
Do we have to choose only 2018? Can't we go back in time to fetch more papers?
No. I've collected all paper under cs
categories (each paper has its own subcategory, so we can easily extract cs.AI and cs.CR, cs.DB etc) for 2018. Total document: 43643
I am currently scraping 2015 - 2017. I will see how many documents we have. (to make sure that each sub category has sufficient amount).
We can choose paper at any time. Now I am doing by year.
Distribution of categories. AGNews: sports vs sci/tech is a viable choice arxiv: ai vs crypto or ml vs crypt sounds reasonable, but we need numbers first.
I am still working on figuring the distribution of each categories for arxiv. I will upload today. Then, will update on the top words class1 vs class2
For arxiv dataset, this is the complete of category :
Based on the total number of papers, Machine learning has the highest number, I suggest we can use Machine Learning vs Information Theory. However, I am not sure until I print the top keywords. I will give you more update today.
Thank you. AI vs Crypto doesn't sound bad, either. Information Theory can be hard to identify.
The XAI submission had only sentiment analysis experiments. We need to find other text classification datasets besides sentiment analysis. Please write your suggestions below.