Pre-processing slow for Subtask A on a dataset of 30,000 entries

pmarcis commented 6 years ago

Hi!

First, thanks for sharing the code!

I am trying to run the Subtask A on a different language (Latvian) dataset of ~30,000 entries, but the pre-processing takes now more than a day (and is stuck at 89% for over 8 hours). It seems that the disk usage is at 9% (why does it require so much disk I/O?), but the RAM usage is only at 1.5-3GB out of 16GB (i.e., there is plenty left). In terms of CPU, only one thread (out of 12) is used at 100%. I have my own word embedding file (from FastText). That is ~1.5GB large. Apart from that I just removed everything from the download directory, placed there the 30,000 dataset file (in the same format as the other files and also named as one of the training files), and replaced the contents of the test and gold files with a dataset of 1000 entries.

I have two questions: 1) Does the fact that there is English hard-coded in the (pre-processing) code mean that the solution won't support other languages out-of-the-box (i.e., without re-writing the pre-processing part)? I have pre-processed the data externally using my own tools. So ... all that I would expect to be done internally, would be splitting tokens on spaces. 2) What is the part that slows everything down so much and do you have an idea of whether it is reasonable to remove the part that is so slow?

cbaziotis commented 6 years ago

Hi @pmarcis, Can you please check if there is a token with an unusual large length in your dataset? In the past I have encountered a similar problem, where there was a token in my dataset consisting of 150+ characters. I assume that this issue has to do with ekphrasis (regex pattern matching).

Since you use your own preprocessing pipeline, you can just override my own and feed the data to the model yourself.

pmarcis commented 6 years ago

Hi!

I am working on a tweet dataset. The dataset contained one token of 90 symbols. Apart from that everything else looks normal. I stopped the process after 3 days. It finished processing the 30K entry dataset, but got stuck on the evaluation set (for a bit less than 2 days), which did not have any abnormally long tokens.

I looked at the ekphrasis code and changed the segment function to (was this enough to skip the pre-processing?):

def segment(self, word):
        if word.islower():
            #return " ".join(self.find_segment(word)[1])
            return word
        else:
            return word

This seems to have fixed the data preprocessing speed issue. However, I am getting rather low precision scores out of the classifier - ~30-40% (in comparison - the lstm implementation from mxnet-sentiment-analysis without word embeddings achieves precision of 46%, a perceptron classifier shows to be even better at 59% without using positive/negative word lists). I am wondering whether I am doing something wrong? Am I right to assume that tweet IDs are not used?

gjmulder commented 6 years ago

Similar to @pmarcis, I am finding the pre-processing to be very slow when running ./models/nn_task_message.py on the competition training set in FINAL mode. It has been running for 15 hours on a 3.5GHz CPU, but seems to be stuck at 24% (10576/44613) for the last 12+ hours. Is this normal?

Update - 24 hours of pre-processing and it has logged 44% (19509/44613). So a rough estimate is pre-processing takes 4 secs per tweet. I'm going to need a lot of CPU for my 1M+ tweets!

gjmulder commented 6 years ago

After profiling ./models/nn_task_message.py I determined that the @lru_cache directives in ekphrasis ./classes/segmenter.py are very small. I upped them both to 262144 in size. The pre-processing now takes less than a minute (as compared to days) and the memory footprint looks to be around 1.1GB.

davidalbertonogueira commented 6 years ago

True, I also had problems here. Pre-processing took more than 6 hours to reach 80%, and then it stuck there for more than 10 hours till I killed the process.

Had around 15'000x speedup (takes few seconds) in the processing speed once I increased lru_caches from 4096 to 8192 and 10000, as well as uncommented some lru_caches that were disabled in the pip-installed ekphraris module.

I would seriously recommend changing this repo to include a modified ekphraris here, if @cbaziotis don't want to have the caches turned on in the ekphraris repo.

cbaziotis commented 5 years ago

Ok, I increased the cache size in ekphrasis. Please upgrade:

pip install ekphrasis -U

cbaziotis / datastories-semeval2017-task4

Pre-processing slow for Subtask A on a dataset of 30,000 entries #38