Examine the data that we have whether they have all words in text-format or in scanned-image format.
Create an environment to process the data. on LRZ, Google Colab, local computer etc.
Find a tool to convert PDF to text
Data processing techniques #16
Tokenization
Lemmatization
POS
…
Think about finding a way to extract structured information like references, authors, part of papers such as abstract, introduction etc. Linking papers to DOI. #21