HazyResearch / evaporate

This repo contains data and code for the paper "Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes"
475 stars 45 forks source link

Sanity check of running code as give doesn't work even with corrected pip installs #24

Open brando90 opened 1 year ago

brando90 commented 1 year ago

To reproduce do the following:

Install all dependencies

# -- Create conda env for evaporate
conda create --name evaporate python=3.10
# conda create --name evaporate python=3.8
conda activate evaporate

# -- Evaporate code
cd ~
# cd $AFS
# echo $AFS
git clone git@github.com:brando90/evaporate.git
# ln -s /afs/cs.stanford.edu/u/brando9/evaporate $HOME/evaporate
cd ~/evaporate
pip install -e .

# -- Install missing dependencies not in setup.py
pip install tqdm openai manifest-ml beautifulsoup4 pandas cvxpy sklearn scikit-learn snorkel snorkel-metal tensorboardX

# -- Weak supervision code
cd ~/evaporate/metal-evap
git submodule init
git submodule update
pip install -e .

# -- Manifest (to install from source, which helps you modify the set of supported models. Otherwise, ``setup.py`` installs ``manifest-ml``)
cd ~
# cd $AFS
# echo $AFS
git clone git@github.com:HazyResearch/manifest.git
# ln -s /afs/cs.stanford.edu/u/brando9/manifest $HOME/manifest
cd ~/manifest
pip install -e .

# -- Git lfs install is a command used to initialize Git Large File Storage (LFS) on your machine.
git lfs install
cd ~/evaporate/
# cd ~/data
git clone https://huggingface.co/datasets/hazyresearch/evaporate
# get data in python
# from datasets import load_dataset
# dataset = load_dataset("hazyresearch/evaporate")

Run file with your open ai key

# keys="PLACEHOLDER" # INSERT YOUR API KEY(S) HERE
keys=$(cat ~/data/openai_api_key.txt)
# echo $keys

cd ~/evaporate/evaporate
conda activate evaporate
# conda activate maf

# evaporate code closed ie
python ~/evaporate/evaporate/run_profiler.py \
    --data_lake fda_510ks \
    --do_end_to_end False \
    --num_attr_to_cascade 50 \
    --num_top_k_scripts 10 \
    --train_size 10 \
    --combiner_mode ws \
    --use_dynamic_backoff True \
    --KEYS ${keys}

Run run_profiler.py

Error:

(evaporate) brando9@ampere1:~/evaporate$ python ~/evaporate/evaporate/run_profiler.py     --data_lake fda_510ks     --do_end_to_end False     --num_attr_to_cascade 50     --num_top_k_scripts 10     --train_size 10     --combiner_mode ws     --use_dynamic_backoff True     --KEYS ${keys}
Traceback (most recent call last):
  File "/lfs/ampere1/0/brando9/evaporate/evaporate/run_profiler.py", line 14, in <module>
    from profiler import run_profiler
  File "/afs/cs.stanford.edu/u/brando9/evaporate/evaporate/profiler.py", line 33, in <module>
    from run_ws import run_ws
ModuleNotFoundError: No module named 'run_ws'

May we get help?


Related issue: https://github.com/HazyResearch/evaporate/issues/21

brando90 commented 1 year ago

I have my own fork trying to fix this basic import statement but code still doesn't work yet. Would be nice to release a version of the code that works out of the box at least for reproducing the original results before trying it on our own custom data.