adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
http://opensource.adobe.com/NLP-Cube/index.html
Apache License 2.0
551 stars 93 forks source link

Use only one feature instead of all of them #97

Closed Luiscri closed 5 years ago

Luiscri commented 5 years ago

It would be very helpful to use just one of the features NLP-Cube offers instead of being forced to use all of them.

Due to computation times, depending of the project NLP-Cube would be perfect if you could run only the part you want to use instead of having to charge all the models. For example, I only want to use the lemmatizer as it is good for Spanish language. I need my text to be tokenized in a specific way and I don't want to use neither NLP-Cube tokenizer nor the rest of features.

I suggest adding another way to run NLP-Cube apart from sentences = cube(text). For example: lemmas = cube.lemmatizer(tokens) would be a nice approach, where tokens can be a list containing the tokens of a sentence and returns another list containing the lemma of each word.

tiberiu44 commented 5 years ago

Hi @Luiscri ,

This is a really good suggestion. In fact, the initial release supported running configurable processing pipeline. However, we later dropped this because of inter-module dependencies (e.g. the lemmatizer requires tagging and so on) and the fact that we wanted to avoid unnecessarily complicating the API.

We are working on building version 2.0 and will consider your suggestion, given that your issue suggests that there is a real use-case for this.

@dumitrescustefan , what do you think?

Best, Tibi

dumitrescustefan commented 5 years ago

@Luiscri thanks for the report. Please see the load function in the api:

def load(self, language_code, version="latest", local_models_repository=None, local_embeddings_file=None, tokenization=True, compound_word_expanding=False, tagging=True, lemmatization=True, parsing=True):

If you want to use only the lemmatizer, call load("es",parsing=False). This will disable the parser. If you already have the text tokenized, put it in the same format as the tokenizer will output (array of ConllEntry) and then load("es",tokenization=False, parsing=False). Note however that the lemmatizer needs the tagger to work (words with different POSes lemmatize differently). So if you specify in the load call tagger=False, it will be ignored and the tagger will be loaded anyway if the lemmatizer is enabled. Hope that answers your question. As Tibi said, we're working on the 2.0 models, but it will take a bit of time.

Thanks!

Luiscri commented 5 years ago

@tiberiu44 @dumitrescustefan Thanks for your answers and time reading my suggestion.

I understood the need of using the tagger model. However I didn't understand what am I supposed to pass as a parameter to the cube method now.

from cube.api import Cube
cube=Cube(verbose=True)
cube.load("es", tokenization=False, parsing=False)
sentences=cube(tokens)

I have tried doing this, where tokens is a list containing a token on every position. What is it supossed to be? I didn't understand your ConllEntry term.

tiberiu44 commented 5 years ago

Tokens have to be a list of ConllEntry: from cube.io_utils.conll import ConllEntry

I'm currently unable to test this (I don't have my laptop with me), but the code should look like this:

from cube.api import Cube
from cube.io_utils.conll import ConllEntry
cube=Cube(verbose=True)
cube.load("es", tokenization=False, parsing=False)
words=["esto", "es", "una", "prueba", "."]
tokens=[]
for w, index in zip(words, range(len(words))):
    # i hope I got the constructor right
    entry=ConllEntry(index+1, "w", "_", "_", "_", "_", 0, "_", "_", "_")
    tokens.append(entry)

# I don't remember if cube  wants a list of tokens or a list of sentences that contain tokens
# so it's either the line below
sentences=cube(tokens)
# or this one
sentences=cube([tokens])

Let me know if this works.

dumitrescustefan commented 5 years ago

Cube wants a list of sentences, each sentence being a list of ConllEntry object, so if you want just one sentence go for cube([tokens]). Otherwise, create a list of tokens for each sentence you have, and append them to another list.

Here's how to print the output:

sentences = cube(text)

    for sentence in sentences:
        print()
        for token in sentence:
            line = ""
            line += str(token.index) + "\t"
            line += token.word + "\t"
            line += token.lemma + "\t"
            line += token.upos + "\t"
            line += token.xpos + "\t"
            line += token.attrs + "\t"
            line += str(token.head) + "\t"
            line += token.label + "\t"
            line += token.deps + "\t"
            line += token.space_after
            print(line)

You have to do the reverse and feed the sentences list as input instead of a string when you call cube(). You'll get the exact same sentences list, but with the lemma filled in for each token.

Finally, note that "_" is the default Conll "missing" attribute, that's why it is initialized as such in the ConllEntry object. Only the index (position of word in sentence) and link to head (the zero in the constructor) are integers.

So pretty much what Tibi said before. We should add a notebook for this, I'll do it a bit later.

Luiscri commented 5 years ago

Hi, thanks for your answer, I finally got it working for a couple of sentences. @tiberiu44 @dumitrescustefan

I made a little script in order to see its behaviour on larger sets of data; specifically I tried it with a Dataset of 16.000 tweets. This is the script:

import os
import csv
import string
import timeit
# lemmatizer
from cube.api import Cube
from cube.io_utils.conll import ConllEntry

lemmatizer = Cube(verbose=True)
lemmatizer.load("es", tokenization=False, parsing=False)

def tokenize(text):
    lemmas = []
    words = text.split()
    words_without_links = [word for word in words if 'http' not in word]
    t = str.maketrans("'!¡?¿.,\"()", "          ")
    raw_tokens = ' '.join(words_without_links).translate(t).split()
    words = [token.strip(string.punctuation).lower() for token in raw_tokens if len(token) > 1]
    tokens = []
    for idx, word in enumerate(words):
        if '@' in word or '#' in word:
            lemmas.append(word)
        else:
            entry = ConllEntry(idx+1, word, "_", "_", "_", "_", 0, "_", "_", "_")
            tokens.append(entry)

    # lemmatizer
    sentences = lemmatizer([tokens])
    for entry in sentences[0]:
        lemmas.append(entry.lemma)
    return lemmas

directory = './detector/data/tweets/Debate/'
files_path = [os.path.join(directory, f) for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]
start_time = timeit.default_timer()

for file in files_path:
    with open(file, 'r') as input_file:
        csv_reader = csv.reader(input_file, delimiter='\t')
        header = next(csv_reader)
        text_column_index = header.index('text')
        for idx, line in enumerate(csv_reader):
            print(idx)
            tweet = line[text_column_index]
            tokenize(tweet)

elapsed = timeit.default_timer() - start_time
print('---------------------')
print(elapsed)
print('---------------------')

When it has processed like 300 tweets I got the following error message:

Traceback (most recent call last):
  File "test.py", line 65, in <module>
    tokenize(tweet)
  File "test.py", line 48, in tokenize
    sentences = lemmatizer([tokens])
  File "/home/luis/.local/lib/python3.6/site-packages/cube/api.py", line 203, in __call__
    predicted_tags_UPOS = self._tagger[0].tag(new_sequence)
  File "/home/luis/.local/lib/python3.6/site-packages/cube/generic_networks/taggers.py", line 100, in tag
    softmax_list, aux_softmax_list = self._predict(seq)
  File "/home/luis/.local/lib/python3.6/site-packages/cube/generic_networks/taggers.py", line 151, in _predict
    char_emb, _ = self.character_network.compute_embeddings(word, runtime=runtime)
  File "/home/luis/.local/lib/python3.6/site-packages/cube/generic_networks/character_embeddings.py", line 119, in compute_embeddings
    attention = self._attend(rnn_outputs, rnn_states_fw[-1], rnn_states_bw[-1])
IndexError: list index out of range

I don't think the List index out of range error comes from the part of the code I added as all list are gone through using a for - in statement.

Moreover the lemmatizer([tokens]) method seems to perform a bit slower; each lemmatization lasting 0,5 secs more or less. Is this behaviour normal?

tiberiu44 commented 5 years ago

I think you passed an empty sentence as input. The tokenizer never does this, but I see you are using your own tokenization code so this could happen.

tiberiu44 commented 5 years ago

Or you passes an empty word

Luiscri commented 5 years ago

I found out that one of the texts that launches the error is this one:

📷Fotografía del Día📷 _______________________________ 📅 Miércoles 17 de abril 2019 _______________________________ 📍#madrid - CTBA _______________________________ 📸Fotografía por… https://t.co/3W7hzr49Wz

What do you think may cause it?

tiberiu44 commented 5 years ago

I think this code:

words = text.split()
    words_without_links = [word for word in words if 'http' not in word]
    t = str.maketrans("'!¡?¿.,\"()", "          ")
    raw_tokens = ' '.join(words_without_links).translate(t).split()

is likely to insert empty words

Just do a sanity check here:

for idx, word in enumerate(words):
        if '@' in word or '#' in word:
            lemmas.append(word)
        else:
            entry = ConllEntry(idx+1, word, "_", "_", "_", "_", 0, "_", "_", "_")
            tokens.append(entry)

like this:

for idx, word in enumerate(words):
        if '@' in word or '#' in word:
            lemmas.append(word)
        else:
            if word.trim()=='':
                print ("Ignoring empty string")
            else:
                entry = ConllEntry(idx+1, word, "_", "_", "_", "_", 0, "_", "_", "_")
                tokens.append(entry)

and here:

    if len(tokens)>0:
        sentences = lemmatizer([tokens])
        for entry in sentences[0]:
            lemmas.append(entry.lemma)
        return lemmas
    else:
        return []
Luiscri commented 5 years ago

Thanks @tiberiu44 , that was a nice spot of the problem. I did what you suggested and it didn't launch the error again.

After having it working I tested the script I posted for a Dataset of 9549 tweets. The total computation time was 1 hour and 11 minutes. Is this delay usual for a volume of 10.000 sentences or may be caused for a bad implementation of mine?

tiberiu44 commented 5 years ago

The lemmatizer is probably the slowest part of NLPCube. It's a seq2seq model and it works at character level. However, 2.34 sentences per second is a little bit slow. I think there might be an issue with the DyNET distribution not using MKL.

What you can do is completely remove DyNET from your installation and follow steps 1, 2 and 3 from https://github.com/adobe/NLP-Cube/blob/master/examples/2.%20Advanced%20usage%20-%20NLP-Cube%20local%20installation.ipynb

Stop right before installing NLPCube.

Luiscri commented 5 years ago

At the start of the script it displays this DyNet messages:

[dynet] random seed: 2075362294
[dynet] allocating memory: 512MB
[dynet] memory allocation done.

Is it right?

However I will try to remove DyNet. In order to remove it do I have to do any additional steps apart from those three on the 'Advanved usage guide'?

Thanks for your time again @tiberiu44 .

dumitrescustefan commented 5 years ago

I guess you installed nlpcube with pip install, so just do a pip3 uninstall dynet, that should auto-remove it. Then follow the install procedure in the tutorial.

It'd be interesting to see performance differences with and without MKL enabled on your end.

Also, don't worry about the dynet messages at the start of the script, they confirm dynet is loaded correctly and how much RAM it'll use. You can set the RAM ammount and the random seed with parameters yourself if you want to replicate experiments. Otherwise just ignore the messages.

tiberiu44 commented 5 years ago

Good question. I think those steps will overwrite the pip installation, which is ok if you installed NLPCube globally (no virtualenv). You can check if everything is ok by listing installed packages after you complete the steps. The DyNET version should be 0 (yes, I think this is a bug), not 2.1

Luiscri commented 5 years ago

Thanks for the advise. I installed DyNet locally and this time it took ~29 minutes on processing the same Dataset. Do you think this is the max process speed it can perform in? Would it help if I assign a higher RAM memory to DyNet? If it would, how is the correct way to do it?

Thanks for your time @tiberiu44 @dumitrescustefan

tiberiu44 commented 5 years ago

It's good that you got better results. That means that you have MKL accelerated DyNET. I suppose it also hogged half of the CPU cores.

Assigning more RAM will not help. 512 is just the initial memory allocation. Whenever required, DyNET increases allocated memory and keeps it that way, until the process ends.

I was expecting something like a 7x increase in performance, but it might be the hardware you are running on. The lemmatizer is a seq2seq model with LSTMs. It is sequential in nature, but you can actually run multiple tagging/lemmatization processes in parallel on the same machine and get an increase in speed. I would split the corpus in 3 slices and run NLPCube in three separate processes.

Luiscri commented 5 years ago

Yes, it may be my computer since it only has 4 cores.

The last question it comes to my mind is:

for idx, word in enumerate(words):
    if '@' in word or '#' in word:
        lemmas.append(word)
    else:
        entry = ConllEntry(idx+1, word, "_", "_", "_", "_", 0, "_", "_", "_")
        tokens.append(entry)

As I'm extracting the mention and hashtag this way, would it affect the result of the the tagger since I'm altering the real composition of the sentence? I did this because I didn't want the lemmatizer to lemmatize mentions and hashtags of the tweets, so this was the easiest way I found.

After this question I think you can close the issue. Thank you so much for your time to both of you, you have been really helpful I didn't expect the extended help I received. @tiberiu44 @dumitrescustefan

tiberiu44 commented 5 years ago

No problem.

Indeed, altering the composition of the text could affect the tagger. Also, I don't think the data it was trained on (Universal Dependencies Corpus) contains any URLs, mentions or hashtags.

I think, the safest way to go about this is to substitute some of the tokens you are skipping now, with words that are likely to have a part-of-speech that is relevant to what you are replacing. Of course, there is also the case where certain tokens don't have any sense inside the sentence (e.g.: 😄 👍 🔢). I think the later tokens can easily be removed without affecting anything.

I hope this helps and, if you need anything else, we'll gladly help.

Best, Tibi