NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

Missing files for the step-by-step NER tutorial #28

Closed ruanchaves closed 2 years ago

ruanchaves commented 2 years ago

Several files that are used on the step-by-step NER tutorial are missing from the data folder ( this folder on the master branch ), so it's currently not possible to execute all steps in the tutorial.

Some examples:

The tutorial uses a spaCy ConLL 2003 annotator, but the folder ../../data/conll2003/ does not exist in this repository. annotator = skweak.spacy.ModelAnnotator("conll2003", "../../data/conll2003/")

Similarly, the paths ../../data/wikidata_tokenised.json, ../../data/crunchbase.json are referenced in the tutorial but they also do not exist in this repository.

The file conll2003_ner.py, which is imported in the tutorial, also makes reference to missing files. Some examples:

FORM_FREQUENCIES = os.path.dirname(__file__) + "/../../data/form_frequencies.json" self.add_annotator(ModelAnnotator("BTC", os.path.dirname(__file__) + "/../../data/btc"))

None of these paths exist in this repository.

plison commented 2 years ago

Hi, some of the data/model files used in this walkthrough are too big to be put on the GitHub repository (as mentioned at the top of the Jupyter notebook), but are accessible for download at https://github.com/NorskRegnesentral/skweak/releases/tag/0.2.8