MartinoMensio / spacy-universal-sentence-encoder

Google USE (Universal Sentence Encoder) for spaCy
MIT License
177 stars 12 forks source link

Deadlock when use Similarity with multiprocessing pool #6

Open ruidaiphd opened 4 years ago

ruidaiphd commented 4 years ago

Hello,

I really like the idea to combine spacy with google stuff, thanks! I got some seemingly random deadlock problem...

I am basically doing n*(n-1)/2 comparison on 50K papers by their titles and abstracts on a server with 32 cores. I tired single-thread and find no problem among a few tens of Ks, but the deadlock happens almost right away when we I ran the following code. Do you have any suggestions? BTW, I also tried processes=1... it does not work...

def simCount(row):  
    return [row[0], row[3], row[2], row[5], nlp(row[1]).similarity(nlp(row[4]))]  

with Pool(processes=25) as p:
    with tqdm(total=count, desc='Testing') as pbar:
        for idx_left, row_left in _sim_tst.iterrows():
/*Some pandas frame arrangement*/
            for simscore in p.imap_unordered(simCount, _4sim.values.tolist()):  
                ssrn_simscore.append(simscore)
                pbar.update()

Many thanks!

MartinoMensio commented 4 years ago

Hi @ray4wit, Thank you for signalling this issue.

I think that the problem is that the attributes that I am using are not serializable. Using a process pool, after creating the pool, the arguments are passed to the workers by means of serialisation/deserialisation because the processes have independent memory.

This is a current issue I am having and I am looking into how to make the docs work with serialisation.

At the moment I would suggest trying with a ThreadPool instead of a Pool. Without using serialisation to pass arguments and results, this would work. I could not run your code snippet because it does not include details about the dataframes you use. Just change the process-based Pool with a thread-based ThreadPool.

I have here an example that I used to test the ThreadPool:

import spacy
import spacy_universal_sentence_encoder
from multiprocessing.pool import ThreadPool, Pool
from tqdm import tqdm
import numpy as np
import pandas as pd

nlp = spacy_universal_sentence_encoder.load_model('en_use_md')

# some data
titles = [
    'Generating Informative Dialogue Responses with Keywords-Guided Networks',
    'Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog',
    'Spoken language understanding for social robotics',
    'Robots That Use Language: A Survey',
    'CraftAssist Instruction Parsing: Semantic Parsing for a Voxel-World Assistant',
    'A Cue Adaptive Decoder for Controllable Neural Response Generation',
    'On the Complexity in Task-oriented Spoken Dialogue Systems',
    'Dialog as a Vehicle for Lifelong Learning',
    'Should Machines Feel or Flee Emotions? User Expectations and Concerns about Emotionally Aware Chatbots',
    'Intention classification in multiturn dialogue systems with key sentences mining'
]
abstracts = [
    'Recently, open-domain dialogue systems have attracted growing attention. Most of them use the sequence-to-sequence (Seq2Seq) architecture to generate responses. However, traditional Seq2Seq-based open-domain dialogue models tend to generate generic and safe responses, which are less informative, unlike human responses. In this paper, we propose a simple but effective keywords-guided Sequence-to-Sequence model (KW-Seq2Seq) which uses keywords information as guidance to generate open-domain dialogue responses. Specifically, KW-Seq2Seq first uses a keywords decoder to predict some topic keywords, and then generates the final response under the guidance of them. Extensive experiments demonstrate that the KW-Seq2Seq model produces more informative, coherent and fluent responses, yielding substantive gain in both automatic and human evaluation metrics.',
    'In this work, we present methods for using human-robot dialog to improve language understanding for a mobile robot agent. The agent parses natural language to underlying semantic meanings and uses robotic sensors to create multi-modal models of perceptual concepts like red and heavy. The agent can be used for showing navigation routes, delivering objects to people, and relocating objects from one location to another. We use dialog clarification questions both to understand commands and to generate additional parsing training data. The agent employs opportunistic active learning to select questions about how words relate to objects, improving its understanding of perceptual concepts. We evaluated this agent on Amazon Mechanical Turk. After training on data induced from conversations, the agent reduced the number of dialog questions it asked while receiving higher usability ratings. Additionally, we demonstrated the agent on a robotic platform, where it learned new perceptual concepts on the fly while completing a real-world task.',
    'Speech understanding is a fundamental feature of social robots, since spoken language is the most natural mean of human-human communication. Providing a robot with the ability to understand human language makes it much more accessible to a wide range of users, especially for those who are not experts in the field. Speech understanding is composed of two sub-tasks. The first one is known as automatic speech recognition (ASR), which is the process of translating or transcribing an audio signal into a written text. The second one is natural language understanding (NLU), which consists in obtaining a semantic interpretation from the (previously) transcribed text. In this work, we present a speech-input natural language understanding system for social robots which has been successfully tested with the well-known HuRIC v1.2 corpus obtaining state-of-the art results. Preliminary versions of the proposed system were also tested in real scenarios during the last two editions of the RoCKIn@Home competition, where we were classified in first and second positions respectively.',
    'This paper surveys the use of natural language in robotics from a robotics point of view. To use human language, robots must map words to aspects of the physical world, mediated by the robot’s sensors and actuators. This problem differs from other natural language processing domains due to the need to ground the language into noisy percepts and physical actions. Here we describe central aspects of language use by robots, including understanding natural language requests, using language to drive learning about the physical world, and engaging in collaborative dialog with a human partner. We describe common approaches, roughly divided into learning methods, logic-based methods, and methods that focus on questions of human-robot interaction. Finally, we describe several application domains for languageusing robots.',
    'We propose a semantic parsing dataset focused on instruction-driven communication with an agent in the game Minecraft. The dataset consists of 7K human utterances and their corresponding parses. Given proper world state, the parses can be interpreted and executed in game. We report the performance of baseline models, and analyze their successes and failures.',
    'In open-domain dialogue systems, dialogue cues such as emotion, persona, and emoji can be incorporated into conversation models for strengthening the semantic relevance of generated responses. Existing neural response generation models either incorporate dialogue cue into decoder’s initial state or embed the cue indiscriminately into the state of every generated word, which may cause the gradients of the embedded cue to vanish or disturb the semantic relevance of generated words during back propagation. In this paper, we propose a Cue Adaptive Decoder (CueAD) that aims to dynamically determine the involvement of a cue at each generation step in the decoding. For this purpose, we extend the Gated Recurrent Unit (GRU) network with an adaptive cue representation for facilitating cue incorporation, in which an adaptive gating unit is utilized to decide when to incorporate cue information so that the cue can provide useful clues for enhancing the semantic relevance of the generated words. Experimental results show that CueAD outperforms state-of-the-art baselines with large margins.',
    'The recent rise of personal voice-assistants shows that research in dialogue systems has gone great lengths from its beginnings many decades ago. We argue that recent research on dialogue complexity has concentrated on already known problems and has remained rather static. We present an overview of past work and argue that increasing dialogue complexity should move again to the centre of interest of new research endeavours.',
    'Dialog systems research has primarily been focused around two main types of applications - task-oriented dialog systems that learn to use clarification to aid in understanding a goal, and open-ended dialog systems that are expected to carry out unconstrained "chit chat" conversations. However, dialog interactions can also be used to obtain various types of knowledge that can be used to improve an underlying language understanding system, or other machine learning systems that the dialog acts over. In this position paper, we present the problem of designing dialog systems that enable lifelong learning as an important challenge problem, in particular for applications involving physically situated robots. We include examples of prior work in this direction, and discuss challenges that remain to be addressed.',
    'As chatbots are becoming increasingly popular, we often wonder what users perceive as natural and socially accepted manners of interacting with them. While there are many aspects to this overall question, we focused on user expectations of their emotional characteristics. Some researchers maintain that humans should avoid engaging in emotional conversations with chatbots, while others have started building empathetic chatting machines using the latest deep learning techniques. To understand if chatbots should comprehend and display emotions, we conducted semi-structured interviews with 18 participants. Our analysis revealed their overall enthusiasm towards emotionally aware agents. The findings disclosed interesting emotional interaction patterns in daily conversations and the specific application domains where emotionally intelligent technology could improve user experience. Further, we identified key user concerns that may hinder the adoption of these chatbots. Finally, we summarized a few guidelines useful for the development of emotionally intelligent conversational agents and identified further research opportunities.',
    'The multiturn dialogue system has been prevalently used in e‐commerce websites and modern information systems, which significantly improves the efficiency of problem solving and further promotes the service quality. In a multiturn dialogue system, the problem of intention classification is a core task, as the intention of a customer is the basis of subsequent problems handling. However, traditional related methods are unsuitable for the classification of multiturn dialogues. Because traditional methods do not distinguish the importance of each sentence and concatenate all sentences in the text, which is likely to generate a model with low prediction accuracy. In this paper, we propose a method of multiturn dialogue classification based on key sentences mining. We design a keywords extraction algorithm, mining key sentences from the dialogue text. We propose an algorithm finishing the computation of the weights of each sentence. According to the sentence weight and the sentence vector, the dialogue text is transformed to a dialogue vector. The dialogue text is classified by a classifier, and the input is the dialogue vector. We conducted sufficient experiments on a real‐world dataset, evaluating the performance of the proposed method. The experimental results show that our method outperforms the related methods on a series of evaluation metrics.'
]

print(len(titles), len(abstracts))
assert len(titles) == len(abstracts)

# matrix for results
matrix = np.zeros((len(titles), len(titles)))

# create spaCy docs
titles_docs = list(nlp.pipe(titles))
abstracts_docs = list(nlp.pipe(abstracts)

# each worker compares a title with every abstract
def process_one(title_doc):
    print(f'running on {title_doc}')
    similarities = [title_doc.similarity(el) for el in abstracts_docs]
    print(f'done on {title_doc}')
    return similarities

# workers get assigned a title each
with ThreadPool(25) as pool:
    for i, result_row in enumerate(pool.imap(process_one, titles_docs)):
        matrix[i] = result_row

print(matrix)

If you really need to use a process-based Pool (for performance or any other reasons), you would now need to rely on Universal Sentence Encoder on its own, or wait for a serialisable version of this library.

I am trying to understand how to do it, and it looks like I need to make the values of the attributes serialisable with msgpack (https://spacy.io/usage/saving-loading#docs). Any suggestion is welcome.

Martino

ruidaiphd commented 4 years ago

Thanks Martino,

Did not think to try Thread... I test 499,500 pairs with 25 threads... take me 7 mins and no deadlock! For my little project, this speed would be good enough. I would try the full sample and let you know if I encounter anything unexpected.

Thanks for your help and for this nice tool (and other tools you develop)!

MartinoMensio commented 4 years ago

You're welcome :)

Yes let me know if you have something unexpected.

Cheers