Authors write articles on tech topics on Devopedia. In the Discussion section, they manually identify questions that beginners are most likely to ask to understand the topic. The goal of this project is to have an algorithm to automatically identify such questions.
The algorithm could consider the following sources:
The above figure shows what Google suggests when we search for the topic "Data Preparation". The first three questions are extremely relevant. The fourth question has the same intent as the first one and therefore such duplicates must be ignored. Another relevant question that Google suggests is "How is data cleaning done?" This is about giving more details from an earlier question about the data preparation process. Another question "What are data vizualization tools?" is related to data preparation but not relevant. Hence, such a suggestion must be ignored.
Research tips shared on Devopedia's Author Guidelines page might help.
When trying to get information from various sources, prefer to use APIs instead of web scraping.
Project must be implemented in Python3 with a modular design. Provide basic documentation and examples. No user interface is expected. Selected questions can be simply display on the console.
Code should support the following:
The present Algorithm is based on attribution of a Users reputation on a source (say, Stackoverflow) and translating it to the reputation score of the Question. We then use some other signals derived from the same source (or a combination of sources) to categorize the questions into Beginner, Intermediate and Expert level questions and rank the questions within each of the categories.