Open benhollar opened 4 years ago
I saw this article about someone who took on a similar problem (scraping recipes from HTML), but it focuses specifically on ingredients. Those are, of course, only part of a recipe.
The NYT also has a library focuses on ingredient parsing, which may be useful for structuring extracted information.
Additional NLP will be needed to structure extracted "recipe content" from HTML.
I already mentioned the NYT ingredient parser, but we may also need to make a classifier that can take a blob of text containing a recipe name
, ingredients
and instructions
and classify it into those parts, line by line.
An example of such an exercise (found by looking up "multi-class text classification") is detailed in this article.
I'll also jot down some thoughts on parsing JavaScript-dependent sites like Allrecipes.
We may need to execute the JS and then read the HTML when creating our gold standard files. For consistency's sake, we'd probably need to do so for the whole database, and then double check "corrected" files still work.
A library that may help with this task is scrapy, which has plugins that would allow us to execute the JS and get the HTML we need for our dataset.
We're going to need to do some amount of research into potential techniques, public datasets, etc. in order to determine what direction we should head here.
Relevant Publications:
Relevant Libraries: