Raldir / FEVEROUS

Repository for Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), accepted to NeurIPS 2021 Dataset and Benchmarks and used for the FEVER Workshop Shared Task at EMNLP2021.
Apache License 2.0
67 stars 20 forks source link

Bugfix/pkg import paths #18

Closed creisle closed 2 years ago

creisle commented 3 years ago

Saw this error running the lastest version (0.0.3)

Traceback (most recent call last):
  File "src/feverous/baseline/predictor/train_verdict_predictor.py", line 19, in <module>
    from utils.annotation_processor import AnnotationProcessor, EvidenceType
ModuleNotFoundError: No module named 'utils'

This PR fixes the above path bugs. Now that the src code is being packed under a feverous folder, any global imports should start with "feverous." so that python can find the submodules etc.

There's a mix of global and relative path imports in the code base so I wasn't sure which one you wanted, I've gone with global imports for now since the scripts can still be run directly then

Note: This fix is required for the other PR (https://github.com/Raldir/FEVEROUS/pull/17) which I have put into draft pending this PR

Raldir commented 3 years ago

Oh no, very sorry for that - should not have happened. I'll look into that asap. Thank you very much Caralyn for the efforts!

Raldir commented 3 years ago

Hi @creisle . I do not think adjusting the path is needed as long as the correct PYTHONPATH is set. I cannot see a line for this in your snakemake but I might be missing something. Using the code as shown in the README works fine on my end. Could you double-check and let me know?

creisle commented 3 years ago

Adding the PYTHONPATH setting before each command should work if the user has the code locally and doesn't install from pip. However, the motivation for me to develop the packaging step originally was to avoid needing to manipulate the path/pythonpath environment variable(s). I haven't come across any other python packages that require setting the PYTHONPATH. The above PR changes means you won't need the PYTHONPATH step and it should work both locally and as a package install from pip

Raldir commented 3 years ago

I see. do you mind sharing the command you are trying to use that is failing? Running python -m feverous.baseline.predictor.evaluate_verdict_predictor --input_path data/dev.combined.not_precomputed.p5.s5.t3.cells.jsonl --wiki_path data/feverous_wikiv1.db --model_path models/feverous_verdict_predictor seems to work fine on my end.

creisle commented 3 years ago

I hadn't tried it with python -m, I had just been running it with python /path/to/script.py, I made a fresh clone and gave the former a try as well but I am still running into import errors in the various scripts

  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/projects/creisle_prj/git/feverous_dev/src/feverous/baseline/retriever/document_entity_tfidf_ir.py", line 4, in <module>
    from baseline.drqa import retriever
ModuleNotFoundError: No module named 'baseline'

module command below

python -m feverous.baseline.retriever.document_entity_tfidf_ir  --model 'data/index/feverous-wiki-docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz' --db data/feverous-wiki-docs.db --count 5 --split dev --data_path data/

Note I am running this with python3.7 on ubuntu 20.04.

I'm not sure how python is able to find the "baseline" module on your end, generally python looks for imports at the top level unless they are prefixed with a '.'. Maybe I am doing something wrong

Could you try with a fresh python and/or conda install and see if you still don't see the error?

Also feel free to decline the PR if you want to address this in a different way, I just went for the soln that had the smallest number of changes to the code, there may be better approaches

Raldir commented 3 years ago

Thank you very much for the detailed response Caralyn -- very much appreciated! Interesting, I am not able to reproduce your error. I created a new environment using conda create -n feverous_pip python=3.7. Then, installing python3.7 -m pip install feverous followed by python3.7 -m pip install cryptography==3.2 (somehow not correctly installed through the package, need to check that) and finally I can run python3.7 -m feverous.baseline.retriever.build_db --db_path data/feverous_wikiv1.db --save_path data/feverous-wiki-docs2.db and it works normally. I have repeated this step on three different systems and they all work fine. That said, I do see your point regarding the absolute path. The egg-info says that feverous is the top-level is the parent directory (because src itself is not considered a container as it does not contain an init) so this might explain how it is handled. That said, I am confused by your error. Could you try replicating the above-mentioned steps exactly and tell me if this error still occurs? I am leaving the PR open until the issue has been resolved. Thanks again for your efforts and help.

creisle commented 3 years ago

Thanks for trying that! Looking at your steps vs mine there seem to be 2 key differences

  1. conda vs regular python
  2. installing from remote pip (pip install feverous) vs installing locally (pip install .)

I can try with the latter change right away, however I can't use this for testing development changes (ex. snakemake) since it has to be publish remotely first. It should be identical to installing locally but maybe something strange is going on. I will try that one out and diff the code version it installs if it works

I don't have conda set up here as I have had a lot of troubles with it in the past, the most problematic being it caused a number of issues in unrelated development projects since it likes to globally manipulate environments/paths. So I'll try the remote pip install with python first if that's ok and then if it still has the error I'll look into conda

creisle commented 3 years ago

Good news and weird news!

The problem is with installing via pypi vs local so no need to worry about conda vs python, however....

TL;DR. Summary

Details

I took a look at what it is doing. If I cd into site-packages post-install I can see that it is installing feverous as one package, but also installing all of its subpackages as top level modules which is what is getting around the import issues.

btree feverous baseline utils database examples
feverous
|-- baseline
|   |-- drqa
|   |   |-- __init__.py
|   |   |-- retriever
|   |   |   |-- doc_db.py
|   |   |   |-- __init__.py
|   |   |   |-- simple.py
|   |   |   |-- tfidf_doc_ranker.py
|   |   |   `-- utils.py
|   |   `-- tokenizers
|   |       |-- corenlp_tokenizer.py
|   |       |-- __init__.py
|   |       |-- regexp_tokenizer.py
|   |       |-- simple_tokenizer.py
|   |       |-- spacy_tokenizer.py
|   |       `-- tokenizer.py
|   |-- drqascripts
|   |   |-- build_tfidf_lines.py
|   |   |-- build_tfidf.py
|   |   `-- __init__.py
|   |-- __init__.py
|   |-- predictor
|   |   |-- evaluate_verdict_predictor.py
|   |   |-- __init__.py
|   |   `-- train_verdict_predictor.py
|   `-- retriever
|       |-- build_db.py
|       |-- build_tfidf.py
|       |-- combine_retrieval.py
|       |-- document_entity_tfidf_ir.py
|       |-- eval_combined_retriever.py
|       |-- eval_doc_retriever.py
|       |-- eval_sentence_retriever.py
|       |-- eval_tab_retriever.py
|       |-- __init__.py
|       |-- predict_cells_from_table.py
|       |-- sentence_tfidf_drqa.py
|       |-- table_tfidf_drqa.py
|       `-- train_cell_evidence_retriever.py
|-- database
|   |-- create_wiki_db.py
|   |-- feverous_db.py
|   |-- __init__.py
|   `-- utils.py
|-- evaluation
|   |-- evaluate.py
|   |-- feverous_scorer.py
|   |-- __init__.py
|   `-- prepare_submission.py
|-- examples
|   |-- __init__.py
|   |-- read_context.py
|   |-- read_lists.py
|   `-- read_tables.py
|-- __init__.py
`-- utils
    |-- annotation_processor.py
    |-- __init__.py
    |-- log_helper.py
    |-- prepare_model_input.py
    |-- util.py
    |-- wiki_list.py
    |-- wiki_page.py
    |-- wiki_processor.py
    |-- wiki_section.py
    |-- wiki_sentence.py
    `-- wiki_table.py

baseline
|-- drqa
|   |-- __init__.py
|   |-- retriever
|   |   |-- doc_db.py
|   |   |-- __init__.py
|   |   |-- simple.py
|   |   |-- tfidf_doc_ranker.py
|   |   `-- utils.py
|   `-- tokenizers
|       |-- corenlp_tokenizer.py
|       |-- __init__.py
|       |-- regexp_tokenizer.py
|       |-- simple_tokenizer.py
|       |-- spacy_tokenizer.py
|       `-- tokenizer.py
|-- drqascripts
|   |-- build_tfidf_lines.py
|   |-- build_tfidf.py
|   `-- __init__.py
|-- __init__.py
|-- predictor
|   |-- evaluate_verdict_predictor.py
|   |-- __init__.py
|   `-- train_verdict_predictor.py
`-- retriever
    |-- build_db.py
    |-- build_tfidf.py
    |-- combine_retrieval.py
    |-- document_entity_tfidf_ir.py
    |-- eval_combined_retriever.py
    |-- eval_doc_retriever.py
    |-- eval_sentence_retriever.py
    |-- eval_tab_retriever.py
    |-- __init__.py
    |-- predict_cells_from_table.py
    |-- sentence_tfidf_drqa.py
    |-- table_tfidf_drqa.py
    `-- train_cell_evidence_retriever.py

utils
|-- annotation_processor.py
|-- __init__.py
|-- log_helper.py
|-- prepare_model_input.py
|-- util.py
|-- wiki_list.py
|-- wiki_page.py
|-- wiki_processor.py
|-- wiki_section.py
|-- wiki_sentence.py
`-- wiki_table.py

database
|-- create_wiki_db.py
|-- feverous_db.py
|-- __init__.py
`-- utils.py

examples
|-- __init__.py
|-- read_context.py
|-- read_lists.py
`-- read_tables.py

This means there is 2 copies of the files for all the subpackages being installed.

We're pretty deep in the weeds on how python import/installs work now but my best guess is that pip install is somehow magically figuring out that it needs to put the subpackages as global packages as well. What I can't figure out is why it doesn't do the same thing locally (same tree command from local install).

feverous
|-- baseline
|   |-- drqa
|   |   |-- __init__.py
|   |   |-- retriever
|   |   |   |-- doc_db.py
|   |   |   |-- __init__.py
|   |   |   |-- simple.py
|   |   |   |-- tfidf_doc_ranker.py
|   |   |   `-- utils.py
|   |   `-- tokenizers
|   |       |-- corenlp_tokenizer.py
|   |       |-- __init__.py
|   |       |-- regexp_tokenizer.py
|   |       |-- simple_tokenizer.py
|   |       |-- spacy_tokenizer.py
|   |       `-- tokenizer.py
|   |-- drqascripts
|   |   |-- build_tfidf_lines.py
|   |   |-- build_tfidf.py
|   |   `-- __init__.py
|   |-- __init__.py
|   |-- predictor
|   |   |-- evaluate_verdict_predictor.py
|   |   |-- __init__.py
|   |   `-- train_verdict_predictor.py
|   `-- retriever
|       |-- build_db.py
|       |-- build_tfidf.py
|       |-- combine_retrieval.py
|       |-- document_entity_tfidf_ir.py
|       |-- eval_combined_retriever.py
|       |-- eval_doc_retriever.py
|       |-- eval_sentence_retriever.py
|       |-- eval_tab_retriever.py
|       |-- __init__.py
|       |-- predict_cells_from_table.py
|       |-- sentence_tfidf_drqa.py
|       |-- table_tfidf_drqa.py
|       `-- train_cell_evidence_retriever.py
|-- database
|   |-- create_wiki_db.py
|   |-- feverous_db.py
|   |-- __init__.py
|   `-- utils.py
|-- evaluation
|   |-- evaluate.py
|   |-- feverous_scorer.py
|   |-- __init__.py
|   `-- prepare_submission.py
|-- examples
|   |-- __init__.py
|   |-- read_context.py
|   |-- read_lists.py
|   `-- read_tables.py
|-- __init__.py
`-- utils
    |-- annotation_processor.py
    |-- __init__.py
    |-- log_helper.py
    |-- prepare_model_input.py
    |-- util.py
    |-- wiki_list.py
    |-- wiki_page.py
    |-- wiki_processor.py
    |-- wiki_section.py
    |-- wiki_sentence.py
    `-- wiki_table.py

There may be some downsides to having the subpackages under feverous as well as top-level, they have pretty generic names so it might end up conflicting with other packages eventually.

Unrelated and not a big deal (just makes the package size a little bigger than it needs to be) but it might be good to add the pyc/pycache files to the ignore list in the MANIFEST.in

When I go to look record file (venv3.7_ubuntu/lib/python3.7/site-packages/feverous-0.0.3.dist-info/RECORD) for the pypi pip installed version it shows a bunch of cache files were copied but also confirms that the subpackages were installed at top level and under the main package.