This repository contains the codes for proof-of-concept work of EaT-PIM (Embedding and Transforming Procedural Instructions for Modification) project. The codes allow users to reproduce and extend the results reported in the work. Please cite the paper when reporting, reproducing or extending the results of this work.
[Camera-ready version of ISWC 2022][Supplemental material]
This project's general goal is to extract information from procedural instructions, represent it explicitly in a flow graph, and suggest modifications or substitutions of entities that are involved in the instructions. This work focuses on the domain of cooking and identifying reasonable ingredient substitutions within specific recipes. The approach utilized in this code involves processing the natural language instructions from recipes into a flow graph representation, followed by training an embedding model that aims to capture the flow and transformation of ingredients through the recipe's steps.
Please note that this software is a research prototype, solely developed for and published as a part of the publication cited above. It will neither be maintained nor monitored in any way.
Set up and activate Python 3.8 virtual environment. The following commands
python3 -m venv venv/
source venv/bin/activate
pip install -r requirements.txt
bdist_wheel
, you may also need to pip install wheel
first. data/RAW_recipes.csv
.Download the appropriate language model for spacy
python -m spacy download en_core_web_trf
This is a transformer model, so performance will be much better if you're set up to use a GPU.
Install pygraphviz for visualization. The following commands are used for installation in Ubuntu.
sudo apt-get install graphviz graphviz-dev
pip install pygraphviz
pip install pydot
Installation for Windows is slightly more involved - please refer to the installation guide for more details on other systems.
export PYTHONPATH=$PYTHONPATH:./eatpim
.python ./eatpim/etl/parse_documents.py ...
.pip install -U spacy[cuda111]
for CUDA 11.1). More details for spaCy's GPU installation can be found here.module 'torch._C' has no attribute '_cuda_setDevice'
. This error is apparently caused by spaCy incorrectly installing a CPU-version of torch, and the versions cause some kind of conflict. This error is apparently fairly common when installing using pip.The workflow to parse raw recipe data into flow graphs and then train embeddings are as follows:
Run eatpim/etl/parse_documents.py --output_dir DIRNAME --n_recipes 1000
, specifying the output directory name and the number of recipes to parse. If no recipe count n is specified, all recipes will be parsed -- approx 230,000. Progress will be printed periodically along with the amount of time elapsed.
In the above image, we can see the progress being printed while the script parses all the ingredients in the recipes (converting to singular form), then parsing each of the recipe's contents. The output will create a pickle file containing the parse results, stored to data/DIRNAME/parsed_recipes.pkl
.
Run eatpim/etl/preprocess_unique_names_and_linking.py --input_dir DIRNAME
to perform some preprocessing over the parse results -- namely making connections between names and entities from FoodOn/Wikidata. Some information about the current progress and intermediate results will be printed periodically (progress info is omitted from the example image).
The above example output shows the number of ingredients, objects, and verbs that were detected in the recipes after the parsing from step 1. The script then makes links among objects, ingredients, FoodOn classes, and Wikidata classes. This step will produce two new files, data/DIRNAME/ingredient_list.json
and data/DIRNAME/word_cleanup_linking.json
.
Run eatpim/etl/transform_parse_results.py --input_dir DIRNAME --n_cpu n
to convert the parsed recipe data into flowgraphs. Multiprocessing will be used over n
processes.
Besides showing the current progress and elapsed time, once all recipes have been processed the number of flow graphs generated by each process (assuming multiprocessing was used) is printed, followed by the total number of graphs produced. This step will produce two new files, data/DIRNAME/entity_relations.json
and data/DIRNAME/recipe_tree_data.json
Optionally eatpim/etl/eatpim_reformat_flowgraph_parse_results.py --input_dir DIRNAME
to perform some additional transformations on the flow graph data, to convert it into a format that is suitable for running the embedding code. This code will make a new folder, as data/DIRNAME/triple_data
, containing several files relevant to training the embedding model.
The script in this step will also handle splitting up the data into train/validation/test splits.
Run eatpim/embeddings/codes/run.py
to train embedding code, to learn embeddings for entities and relations that occurred in the recipe flow graph data. The parameters I used to run the training are as follows:
--do_train --cuda --data_path recipe_parsed_sm --model TransE -n 256 -b 2048 --train_triples_every_n 100 -d 200 -g 24.0 -a 1.0 -lr 0.001 --max_steps 2000000 -save models/sm_transe_retry --test_batch_size 4 -adv -cpu 1 --warm_up_steps 150000 --save_checkpoint_steps 50000 --log_steps 5000
A small snippet of the outputs made while training is shown below - training details like the current loss, training step, and time are logged. Logs are saved to data/DIRNAME/MODELDIR/train.log
.
-save
directory within eatpim/embeddings/codes
. --do_train
argument with --do_valid
or --do_test
, respectively.To use the trained embeddings, run eatpim/rank_subs_in_recipe.py --data_path DIRNAME --model_dir MODELDIR
. For an example run using trained embedding data uploaded in this repository, you can use eatpim/rank_subs_in_recipe.py --data_dir recipe_parsed_sm --model_dir models/GraphOps_recipe_parsed_sm_graph_TransE
to see an example of various ranking strategies for a random recipe and random ingredient. Some examples of the outputs can be seen below.
Several different ranking schemes, and the corresponding top-10 "best" substitution options, are shown in the output.
Visualizations of the flow graph also are produced by this step. The above image shows an example visualization of a flow graph of the recipe for which the script is ranking substitutions.
If the codes of this project is useful in your research, we would kindly ask you to cite our paper:
@InProceedings{EatpimISWC2022,
author="Sola S. Shirai and HyeongSik Kim",
title="EaT-PIM: Substituting Entities in Procedural Instructions Using Flow Graphs and Embeddings",
booktitle="The Semantic Web -- ISWC 2022",
year="2022",
}
EaT-PIM is open-sourced under the AGPL-3.0 license. See the LICENSE file for details. For a list of other open source components included in EaT-PIM, see the file 3rd-party-licenses.txt.