cbaziotis / ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
MIT License
660 stars 91 forks source link

Word Statistics File not Found. | Receiving 404 error while dowloading the file. #28

Closed imVParashar closed 2 years ago

imVParashar commented 2 years ago

While using the library, the word statistics file is again missing from its original source:

Please fix this as soon as possible and please make some more robust solutions for hosting this file. It looks like people faced this problem in the past as well.

Due to this issue, the production service is stopped. Please fix this asap!

Thanks in advance.!

shihabshahriar16 commented 2 years ago

same issue here!

Nick18899 commented 2 years ago

We are going to have a deploy of project on Saturday, and the tokenizer has fallen!!! Please, repair it quickly!

zahrahnnx commented 2 years ago

Same error, May you please help to fix it?

jeremy-yuan07 commented 2 years ago

Same error, waiting the solution. Thanks in advance.

asmhack commented 2 years ago

Uncompress and put that folder into home dir. So should be: ~/.ekphrasis/stats/...

https://we.tl/t-hwj94h9MMJ

yistarostin commented 2 years ago

Uncompress and put that folder into home dir. So should be: ~/.ekphrasis/stats/...

https://we.tl/t-hwj94h9MMJ

Hi. Thank you, with you advice I managed to fix the mentioned problem, but how there is a new one: I am using tokenizer for twitter with following flags:

text_processor = TextPreProcessor(
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'url', 'date', 'number'],
    annotate={"hashtag",# "allcaps", 
              "elongated", "repeated",
        'emphasis', 'censored'},
    fix_html=True,  # fix HTML tokens
    segmenter="twitter", 
    corrector="twitter", 
    #unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=True,  # Unpack contractions (can't -> can not)
    spell_correct_elong=False,  # spell correction for elongated words
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    dicts=[emoticons]
  )

And now it says:

---TOKENIZING TWEETS NOW---
Reading twitter - 1grams ...
stats file not available!
An exception has occurred, use %tb to see the full traceback.

SystemExit: 1
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

Maybe the ZIP you provided doesn't have necessary archive for tokenizing twitter? By the way, I am ytring to make it work in Google Colab, if it is important.

asmhack commented 2 years ago

Hi @yistarostin, few observations I've made from your message:

  1. That zip contains following folders (see the screenshot) (and twitter is included).
  2. You have to unzip it. It should be a folder not a zip file.
  3. In Google Colab the home directory is /root. So please carefully check if those files are available there. It should looks like /root/.ekphrasis/stats/{and here folders from screen bellow}
Screenshot 2021-10-07 at 12 31 39
yistarostin commented 2 years ago

Hi @yistarostin, few observations I've made from your message:

  1. That zip contains following folders (see the screenshot) (and twitter is included).
  2. You have to unzip it. It should be a folder not a zip file.
  3. In Google Colab the home directory is /root. So please carefully check if those files are available there. It should looks like /root/.ekphrasis/stats/{and here folders from screen bellow}
Screenshot 2021-10-07 at 12 31 39

Well, I re-made your steps and it worked! I guess I accidentally unzipped to /content instead of /root. Thank you and Spasibo!

fucaja commented 2 years ago

Hi @yistarostin, I am new to using github, could you explain how it worked for you?

Tried using !git clone https://github.com/cbaziotis/ekphrasis.git in /root/ folder in colab (see the screenshot). How can I use the library?

ekp

yistarostin commented 2 years ago

@fucaja Hi. To use this and all other modules, you need to install that. For instance, to install this module ekphrasis, you need to simply do pip install ekphrasis from terminal, or !pip install ekphrasis (the same with exclamation mark) from python code. Technically, you can clone the repository, %cd to the folder of the repositry and then do !pip install -e, but this is a really weird way to install, as you need to know the full URL to the repository to clone it. For instance, if the repository would get moved to another Git hosting platform, you code would just stop working. So, to install any repository, just do !pip install [module name] To use this library, do

import [module name]

in your python code For instance, this module includes several classes, to use them do:

from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons

Full example is listed in the README.md of repo (on the front page)

fucaja commented 2 years ago

Hi @yistarostin.

Using !pip install I don't know where I should add the stats files in colab. Could you explain me? Thanks in advance

ekp

yistarostin commented 2 years ago

@fucaja As advised before, you need to put ekphrasis dictionary files to /root/ekphrasis. In normal circumstances, it is performed automatically, but somehow it is now broken, that is why we are here in this issue. So, you need to manually download .zip archive from the link mentioned in previous comments, than upload this file to Colab to /root folder, then change directory to /root, and than unzip the archive.

fucaja commented 2 years ago

I solved the problem changing the url on helpers.py adding a new link to a repository of the stats files

!pip install git+https://github.com/fucaja/ekphrasis.git

ycchanau commented 2 years ago

still get the same error. Already fixed?

Word statistics files not found!
Downloading... 
frankniujc commented 2 years ago

Here's a version of my ~/.ekphrasis from an old installation: https://utoronto-my.sharepoint.com/:u:/g/personal/frank_niu_mail_utoronto_ca/Ed0k1JhgN8JJjmVxaBR_OzsBpMGlhhslAE9h3apvY9I_lA?e=tyZ7Nz

Unzip it and put home/frank/.ekphrasis at ~/.ekphrasis should solve the problem.

Notice that my link is also not permanent (limited by my university's onedrive sharepoint policy). Hopefully this issue can be properly patched before the link expired.

ArlanCooper commented 2 years ago

Uncompress and put that folder into home dir. So should be: ~/.ekphrasis/stats/...

https://we.tl/t-hwj94h9MMJ

the original url has expired, can you make another new url to download the dataset, thanks

cbaziotis commented 2 years ago

Initially, I used my personal dropbox account to host the file as only some friends and I were using the library. It turns out that dropbox has suspended my public links for generating excessive traffic...

I moved the data to another server and updated the public link for the stats.zip file. Please, ppdate the package and try again.

build from source

pip install git+git://github.com/cbaziotis/ekphrasis.git

or install from pypi

pip install ekphrasis -U

FYI the link is https://data.statmt.org/cbaziotis/projects/ekphrasis/stats.zip

Let me know if it works now.