In this project, we explore the topic of classifying papers according to categories with a paper's abstract as input. Originally, we seeked to generate metadata, such as keyphrases, from the abstract of an article, but we found paper classification alone to be nontrivial. With that, we explore different architectures using various text representations and compare performances within the space of academic literature abstracts.
Our work is publically available in our Google Drive workspace, linked below. One can access our notebooks and run them. Please note that our tests make use of GPUs and assume CUDA is installed in one's workspace.
Some of the notebooks contain experiments of the experiments, and thus are largely irrelevant for outsiders. The major notebooks pertaining experiments with LSTM/RNNs architectures are
The main notebooks pertaining experiments with CNN architectures are
finalCNNexperiment
file behaves as we reduce/increase the input size, from 50 characters all the way up to 270The word2vec_lstm notebook used a pretrained word2vec from Google in conjunction with an LSTM architecture to create an experiment with word embeddings, as opposed to the character encodings we used everywhere else.
The .py
and .json
files were used to scrape arxiv for our dataset of more than 550.000 papers.