NIHOPA / NLPre

Python library for Natural Language Preprocessing (NLPre)
190 stars 34 forks source link

Performance Issues with joblib #112

Closed ldorigo closed 5 years ago

ldorigo commented 5 years ago

Hi there,

Most likely this stems from me doing something wrong, but I am getting ~20x slower speeds when using multiple processes as shown in the readme.

Here's my code:

from tqdm import tqdm
from nlpre import (
    decaps_text,
    titlecaps,
    dedash,
    unidecoder,
    token_replacement,
)
from joblib import Parallel, delayed

parsers = [
    dedash(),
    titlecaps(),
    decaps_text(),
    unidecoder(),
    token_replacement(),
]

def normalize_abstracts(abstracts: List[Abstract]):
    def pipeline(t):
        for p in parsers:
            # One of the parsers sometimes fails
            try:
                t = p(t)
            except: 
                pass
        return t

    # Make an explicit array out of abstract texts (to be sure the slowdown isn't caused by some weird sqlalchemy datastructure
    texts = [abstract.original_text for abstract in abstracts ]
    # Launch the preprocessing in parallel:
    with Parallel(-1) as MP:
        norm_texts = MP(delayed(pipeline)(t) for t in tqdm(texts))
    # Fill sqlalchemy objects with the preprocessed abstracts
    for index,abstract in enumerate(abstracts):
        abstract.normalized_text = norm_texts[index]

This runs at ~1.3 iterations per second according to TQDM. The equivalent non-concurrent code runs at around 20 iterations per second.

thoppe commented 5 years ago

Hi @ldorigo. You're not wrong in your assessment. In a major change to NLPre, we moved the backend to spaCy. Before, most of the processing was done in either pyparsing or pattern and in those cases, running the code in parallel worked well. What we've found with spaCy however is it runs fairly well out of the box (in parallel!) without needing to launch joblib. In fact, the overhead joblib creates (by pickling), creates a massive slowdown!

This is more of an issue with the docs, than it is with the code itself. Thank you for bring it to our attention and we will adjust accordingly.