Open cphoover opened 1 year ago
Do you have a minimal code snippet to reproduce this error? As for fixing it, I am suspecting that there might be significant hurdles to allowing serializiation of EntityCollection
.
From what I could find in the spacy issue tracker, it seems that there are even issues with serializing Span
objects, which EntityCollection
relies upon internally.
Hi @cphoover and @dennlinger ,
I have a minimal test set up from this morning, let me share with you in a new branch where we can try to make the objects serialisable.
As @dennlinger mentioned, we will need to change something about the span
being memorized in the EntityElement and this will probably break the code that is using the .get_span()
method. I was experimenting this morning with some exotic code with weakref but I could not restore correctly the spans. I am trying to understand from SpaCy codebase how they manage to save the references to doc
without going in a loop of serialization with .to_bytes()
.
Let's collaborate on this issue! I will push a new branch multiprocessing
where we will try to address this issue (as soon as I resolve my gpg errors on my machine 😂)
Best, Martino
Here's an example snippet that reproduces the issue. Replace the text file with any large text file where each line is a document to be processed.
import spacy
import json
import time
from spacy.util import minibatch
from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
import multiprocessing
from typing import Iterable
# from tf_idf import get_tfidf_ranking
labels = [
"entertainment",
"politics",
"world news",
"contract",
"sports",
"business",
"travel"
]
def chunk(list, n):
for i in range(0, len(list), n):
yield list[i:i + n]
model_name = "en_core_web_sm"
# Load the spaCy model and enable CUDA
nlp = spacy.load(model_name)
nlp.add_pipe("entityLinker", last=True)
print('*****************************************')
print("Model: ", model_name)
# get number of available cores
n_cores = multiprocessing.cpu_count()
print("Number of cores: ", n_cores)
def process_messages(input_data):
_ = [d for d in nlp.pipe(input_data,
n_process=n_cores,
)]
return _
batch_size = 100
num_iterations = 10
text_file = "imdb_preprocessed.txt"
# Start the timer
start_time = time.time()
# loop through text file line by line
with open(text_file, "r") as f:
text_data = f.readlines()
count = 0
# batch the text_data lines to batch_size
for doc_batch in chunk(text_data, batch_size):
results = process_messages(doc_batch)
for result in results:
print("result:" + str(result))
for ent in result.ents:
les = ent._.linkedEntities if isinstance(ent._.linkedEntities, Iterable) else [
ent._.linkedEntities
]
linked_entities = []
for le in les:
if le:
tSes = le.get_super_entities() if isinstance(le.get_super_entities(),
Iterable) else [le.get_super_entities()]
ses = [{
"id": se.get_id(),
"label": se.get_label()
} for se in tSes]
linked_entities.append({
"desc": le.get_description(),
"url": le.get_url(),
"kb_id": le.get_id(),
"label": le.get_label(),
"span": str(le.get_span()),
"super_entities": ses,
})
print(json.dumps(
{
"entity_group": ent.label_,
"word": ent.text,
"linked_entities": linked_entities,
},
indent=4))
if (count == num_iterations):
break
count += 1
end_time = time.time()
total_time = end_time - start_time
# Print the total time and the number of docs processed per second
print("Batch size: ", batch_size)
print("Number of iterations: ", num_iterations)
print("total lines processed: ", count * batch_size)
print("Total time: ", total_time)
print("Docs per second: ", (batch_size * num_iterations) / total_time)
print('****************************************')
With the last commits on the branch multiprocessing
, you should be able to use all the cores without problems.
I tested your script and with a small edit I made it run without exceptions (line 29 onwards, the code is wrapped inside a if __name__ == '__main__':
to avoid a common issue https://stackoverflow.com/questions/55057957/an-attempt-has-been-made-to-start-a-new-process-before-the-current-process-has-f ) .
import spacy
import json
import time
from spacy.util import minibatch
from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
import multiprocessing
from typing import Iterable
# from tf_idf import get_tfidf_ranking
labels = [
"entertainment",
"politics",
"world news",
"contract",
"sports",
"business",
"travel"
]
def chunk(list, n):
for i in range(0, len(list), n):
yield list[i:i + n]
model_name = "en_core_web_sm"
if __name__ == '__main__':
# Load the spaCy model and enable CUDA
nlp = spacy.load(model_name)
nlp.add_pipe("entityLinker", last=True)
print('*****************************************')
print("Model: ", model_name)
# get number of available cores
n_cores = multiprocessing.cpu_count()
print("Number of cores: ", n_cores)
def process_messages(input_data):
_ = [d for d in nlp.pipe(input_data,
n_process=n_cores,
)]
return _
batch_size = 100
num_iterations = 10
text_file = "imdb_preprocessed.txt"
# Start the timer
start_time = time.time()
# loop through text file line by line
with open(text_file, "r") as f:
text_data = f.readlines()
count = 0
# batch the text_data lines to batch_size
for doc_batch in chunk(text_data, batch_size):
results = process_messages(doc_batch)
for result in results:
print("result:" + str(result))
for ent in result.ents:
les = ent._.linkedEntities if isinstance(ent._.linkedEntities, Iterable) else [
ent._.linkedEntities
]
linked_entities = []
for le in les:
if le:
tSes = le.get_super_entities() if isinstance(le.get_super_entities(),
Iterable) else [le.get_super_entities()]
ses = [{
"id": se.get_id(),
"label": se.get_label()
} for se in tSes]
linked_entities.append({
"desc": le.get_description(),
"url": le.get_url(),
"kb_id": le.get_id(),
"label": le.get_label(),
"span": str(le.get_span()),
"super_entities": ses,
})
print(json.dumps(
{
"entity_group": ent.label_,
"word": ent.text,
"linked_entities": linked_entities,
},
indent=4))
if (count == num_iterations):
break
count += 1
end_time = time.time()
total_time = end_time - start_time
# Print the total time and the number of docs processed per second
print("Batch size: ", batch_size)
print("Number of iterations: ", num_iterations)
print("total lines processed: ", count * batch_size)
print("Total time: ", total_time)
print("Docs per second: ", (batch_size * num_iterations) / total_time)
print('****************************************')
Please @cphoover could you tell me if this works for you too? You will need to install the package from source in the multiprocessing branch:
pip install git+https://github.com/egerber/spaCy-entity-linker@multiprocessing
The only limitation comes when you will try to read the .span
property or call the .get_span()
method, as they will return None.
If anyone has any ideas on how to resolve the serialization of span
, feel free to comment!
I tried with the following:
span_group
instead of span
, this seems to work, as span_group
has the to_bytes
and from_bytes
methods. However this also requires to save a doc
which is also not very easy to serializedoc
with DocBin
, but I got lost when trying to restore a doc from DocBin
as it needs an instance of the nlp
objectI added some tests to check that everything is restored properly, and at the moment 1 test fails when span
is checked.
Best, Martino
@MartinoMensio Yes this fixes my issue... Here are two benchmarks one with entity linking and one with only the NER:
on AWS c5a.2xlarge machine
Model: en_core_web_sm With entity linking! Using CPU Number of cores: 8 Batch size: 2000 Number of iterations: 20 total lines processed: 40000 Total time: 439.9979598522186 Docs per second: 90.90951242918202
Model: en_core_web_sm Using CPU Number of cores: 8 Batch size: 2000 Number of iterations: 20 total lines processed: 40000 Total time: 134.31675672531128 Docs per second: 297.8034980535099
I'm seeing an issue where the pipeline crashes during msgpack serialization when I set n_processes > 1.
I want to leverage all cores on my AWS c5a.2xlarge machine so I have set n_processes = number of cores.
But now my entity linking pipeline is crashing.
Below is a paste of the stacktrace: