A Software Tool for Finding Related Work in Academia
Article Information Parser is an instrument to parse, unify, and in some cases correct article metadata. AIP creates a PostgreSQL database that allows for easily finding related work. AIP++ is an extension of the original tool to improve on the article discovery and querying capabilities of the original tool.
More information on how this tool works can be found in the project report.
:fire: Please read this section before continuing :fire:
To run AIP, make sure you have Docker Desktop installed.
It is also important that you have a recent version of the database which can be obtained by contacting us directly. You can also generate your own database by following the instructions here.
Ensure you have a recent database dump called data.backup
in the services/db
folder of this repository.
Build and run the Docker containers to run this project.
$ docker compose up --build -d
Access the application through your web browser by going to http://localhost:8000
Use docker compose up --build
to rebuild your project. This is useful, if you
want to update the application after pulling it from git.
If you want to update the database, you will have to remove the database volume by:
docker compose down
.docker volume ls
. It is prefixed with the name of the root folder (usually aip
) and ends with _db_data
.docker volume rm <volume_name>
Managing a database of article metadata is tricky, as highlighted by an excerpt from our article introducing AIP:
Current information sources do not cover the spectrum of the systems community entirely. For example, DBLP -- which specifically focuses on computer science articles -- lacks certain venues and does not record article abstracts. Other datasets such as Semantic Scholar and AMiner have similar and other limitations. Moreover, these datasets also overlap, yet contain important information the others do not offer; they are disjoint. Our approach is to parse each dataset and filter and unify the information provided.
AIP combines three data sources, namely DBLP, Semantic Scholar, and AMiner.
DBLP is a well-known European archive that focuses on computer science and features all the top-level venues (journals and conferences). Semantic Scholar is an American project created by the Allen Institute for AI. The project aims to analyze and extract important data from scientific publications. AMiner is an Asian project that aims to provide a knowledge graph for mining academic social networks. Both AMiner and Semantic Scholar have incorporated Microsoft's Academic Graph (MAG) in their datasets nowadays.
AIP tackles several non-trivial challenges in unifying these datasets: