Open Timoeller opened 4 years ago
I think discovering and extracting the questions is easy enough. Looking for sentences that end with a question mark. Maybe also searching for some frequently used keywords in those sentences like "corona, protect, safety, cure, home office...". Then comparing xpath structures to build an xpath structure that can identify all those questions even if they does not contain any of those keywords.
Detecting answers might be more tricky. Usually you might expect them to be after the question it belongs to but also before the next question shows up. However sometimes there is also an overview of the questions with anchor links to the actual Q&A entry. So looking for anchor links around a question could be one way to avoid scraping failures. So using the content between two questions might contain the answer text but might also contain a lot more fragments that we may want to avoid. And cleaning those unwanted fragments might be the hardest challenge I guess.
I will start working on this. Just got all the pieces - with pycharm (been an emacs person till now) working. If someone has experience with web crawling and would like to partner, do reach out!
İ thought about using some computer vision models like Yolo3 , to segment the FAQ section on the page to question and its answer, but I'm not sure is it worth it or if it can have some drawbacks. Any suggestions are welcomed.
For now we have individual scrapers for each site. Adding more sites is a very manual and slow process and existing scrapers fail when the site changes slightly. See individual scrapers here.
Automatic Scraper We need a scraper that takes in a URL to an FAQ page and automatically extracts questions and answers in a structured way. The scraper might need some NLP based question detection to identify which parts need to be extracted. For some pseudo code see here.
Datasources We can curate a sheet of official FAQ pages and crawl relevant information more quickly. That way the community can check the validity of the source FAQ and if the extraction worked.