benhollar / TheSpiceRack

An intelligent recipe book
Apache License 2.0
0 stars 2 forks source link

Web Scraping ML Research #3

Open benhollar opened 3 years ago

benhollar commented 3 years ago

We're going to need to do some amount of research into potential techniques, public datasets, etc. in order to determine what direction we should head here.

Relevant Publications:

Relevant Libraries:

benhollar commented 3 years ago

I saw this article about someone who took on a similar problem (scraping recipes from HTML), but it focuses specifically on ingredients. Those are, of course, only part of a recipe.

https://schollz.com/blog/ingredients/

benhollar commented 3 years ago

The NYT also has a library focuses on ingredient parsing, which may be useful for structuring extracted information.

https://github.com/nytimes/ingredient-phrase-tagger

benhollar commented 3 years ago

Additional NLP will be needed to structure extracted "recipe content" from HTML.

I already mentioned the NYT ingredient parser, but we may also need to make a classifier that can take a blob of text containing a recipe name, ingredients and instructions and classify it into those parts, line by line.

An example of such an exercise (found by looking up "multi-class text classification") is detailed in this article.

benhollar commented 3 years ago

I'll also jot down some thoughts on parsing JavaScript-dependent sites like Allrecipes.

We may need to execute the JS and then read the HTML when creating our gold standard files. For consistency's sake, we'd probably need to do so for the whole database, and then double check "corrected" files still work.

A library that may help with this task is scrapy, which has plugins that would allow us to execute the JS and get the HTML we need for our dataset.