BieniekAlexander / metadataMaker

Generating metadata from arXiv article contents
1 stars 0 forks source link

metadataMaker

Introduction

In this project, we explore the topic of classifying papers according to categories with a paper's abstract as input. Originally, we seeked to generate metadata, such as keyphrases, from the abstract of an article, but we found paper classification alone to be nontrivial. With that, we explore different architectures using various text representations and compare performances within the space of academic literature abstracts.


Workspace

Our work is publically available in our Google Drive workspace, linked below. One can access our notebooks and run them. Please note that our tests make use of GPUs and assume CUDA is installed in one's workspace.

Google Drive Workspace

Some of the notebooks contain experiments of the experiments, and thus are largely irrelevant for outsiders. The major notebooks pertaining experiments with LSTM/RNNs architectures are

The main notebooks pertaining experiments with CNN architectures are

The word2vec_lstm notebook used a pretrained word2vec from Google in conjunction with an LSTM architecture to create an experiment with word embeddings, as opposed to the character encodings we used everywhere else.

The .py and .json files were used to scrape arxiv for our dataset of more than 550.000 papers.