Open c-koster opened 1 year ago
Hi @c-koster! Thanks for bringing that up. We indeed ran this on a machine with a lot of memory (120 GB), so it's possible we overlooked memory bottlenecks. Can you post the complete stack trace?
One way we could fix this would be to turn these two calls into generators.
Hi, @rmitsch!
I re-ran the script with en_core_web_md
and had the same issues. I was unable to recover a stack trace (the command fails silently). However here is a line from from /var/log/messages:
Oct 30 16:37:42 ck-wikidata kernel: [428663.963815] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=gce_instance_monitor.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-3.scope,task
I'm currently looking into limiting the results based on link count. A threshold of 20 links (which i got from this spaCy video on entity linking) yields about 1.8M entities.
Also, re: making the create_kb
step more memory efficient, I'd be happy to take this on. Particularly I see two changes to the code which are probably separate:
Thanks for your help!
Also, re: making the create_kb step more memory efficient, I'd be happy to take this on. Particularly I see two changes to the code which are probably separate:
- modifying the SQL queries to allow for link count thresholds
- replacing the list comprehensions with generators and/or setting a good mmap limit in the ddl file
Yes, these are both good options. Ideally we'd get this working without the necessity of link count thresholds, as this is a magic number that might be hard to set properly for users. Anyway, this will require a deeper look into the memory bottlenecks in wikid
. If you want to look into this, happy to support you along the way! Otherwise I'll update here once we got around to doing so.
Hello @rmitsch
I did some investigating of the the create_kb
step with a memory profiler (both a time-based sampling approach, and a line-by-line profile of the main function).
First, here are some lines from a profile of the create_kb
step with filtered dumps:
19 220.223 MiB 220.223 MiB 1 @profile
20 def main(vectors_model: str, language: str):
27 1036.113 MiB 815.891 MiB 1 nlp = spacy.load(vectors_model, exclude=["tagger", "lemmatizer", "attribute_ruler"])
29 1036.113 MiB 0.000 MiB 1 [logger.info](http://logger.info/)("Constructing knowledge base.")
30 1036.113 MiB 0.000 MiB 1 kb = DefaultKB(vocab=nlp.vocab, entity_vector_length=nlp.vocab.vectors_length)
31 1036.113 MiB 0.000 MiB 1 entity_list: List[str] = []
32 1036.113 MiB 0.000 MiB 1 count_list: List[int] = []
33 1036.113 MiB 0.000 MiB 1 vector_list: List[numpy.ndarray] = [] # type: ignore
34 1147.918 MiB 111.805 MiB 1 entities = wiki.load_entities(language=language)
35 1149.723 MiB 1.289 MiB 76016 ent_descriptions = {
36 1149.723 MiB 0.000 MiB 76012 qid: entities[qid].description
37 1149.723 MiB 0.000 MiB 38006 if entities[qid].description
38 else (
39 1149.723 MiB 0.516 MiB 3063 entities[qid].article_text[:200]
40 1149.723 MiB 0.000 MiB 3063 if entities[qid].article_text
41 1149.723 MiB 0.000 MiB 866 else entities[qid].name
42 )
43 1149.723 MiB 0.000 MiB 38007 for qid in entities.keys()
44 }
47 1188.328 MiB 36.246 MiB 76016 desc_vectors = [
48 1187.191 MiB 0.105 MiB 38006 doc.vector
49 1187.191 MiB 0.387 MiB 38008 for doc in tqdm.tqdm(
50 1149.723 MiB 0.000 MiB 2 nlp.pipe(
51 1149.723 MiB 0.000 MiB 38009 texts=[ent_descriptions[qid] for qid in entities.keys()], n_process=-1
52 ),
53 1149.723 MiB 0.000 MiB 1 total=len(entities),
54 1149.723 MiB 0.000 MiB 1 desc="Inferring entity embeddings",
55 )
56 ]
57 1186.465 MiB -1.863 MiB 38007 for qid, desc_vector in zip(entities.keys(), desc_vectors):
58 1186.465 MiB 0.000 MiB 38006 entity_list.append(qid)
59 1186.465 MiB 0.000 MiB 38006 count_list.append(entities[qid].count)
60 1186.465 MiB 0.000 MiB 76012 vector_list.append(
61 1186.465 MiB 0.000 MiB 38006 desc_vector if isinstance(desc_vector, numpy.ndarray) else desc_vector.get()
62 )
63 1233.895 MiB 47.430 MiB 2 kb.set_entities(
64 1186.465 MiB 0.000 MiB 1 entity_list=entity_list, vector_list=vector_list, freq_list=count_list
65 )
66
67 # Add aliases with normalized priors to KB. This won't be necessary with a custom KB.
68 1261.734 MiB 27.840 MiB 2 alias_entity_prior_probs = wiki.load_alias_entity_prior_probabilities(
69 1233.895 MiB 0.000 MiB 1 language=language
70 )
71 1271.395 MiB 0.250 MiB 62533 for alias, entity_prior_probs in alias_entity_prior_probs.items():
72 1271.395 MiB 7.020 MiB 125064 kb.add_alias(
73 1271.395 MiB 0.000 MiB 62532 alias=alias,
74 1271.395 MiB 1.137 MiB 258330 entities=[epp[0] for epp in entity_prior_probs],
75 1271.395 MiB 1.254 MiB 258330 probabilities=[epp[1] for epp in entity_prior_probs],
76 )
77 # Add pseudo aliases for easier lookup with new candidate generators.
78 1280.598 MiB 1.480 MiB 38007 for entity_id in entity_list:
79 1280.598 MiB 7.074 MiB 76012 kb.add_alias(
80 1280.598 MiB 0.648 MiB 38006 alias="_" + entity_id + "_", entities=[entity_id], probabilities=[1]
81 )
82
83 # Serialize knowledge base & pipeline.
84 1280.598 MiB 0.000 MiB 1 output_dir = Path(os.path.abspath(__file__)).parent.parent / "output"
85 1332.773 MiB 52.176 MiB 1 [kb.to](http://kb.to/)_disk(output_dir / language / "kb")
86 1332.773 MiB 0.000 MiB 1 nlp_dir = output_dir / language / "nlp"
87 1332.773 MiB 0.000 MiB 1 os.makedirs(nlp_dir, exist_ok=True)
Something I notice is that aside from the spacy.load step (which will be constant for the unfiltered dumps), the wiki.load_entities
step is the most expensive. I think the encodings and creation of knowledge base entities could be done all at once (without incurring this memory cost) if the load_entities function returns a generator instead of a dictionary.
I also did a time-based analysis of the full (unfiltered) step but it made a lot less sense. The memory used spikes to about 20GB and then falls off. I suspect this is related to a very large group-by query being run locally but haven't tested this thoroughly.
Some questions:
Thanks!
Hello!
I am working to create a knowledge base using the latest (unfiltered) English wiki dumps. I've successfully followed the steps in benchmarks/nel up to
wikid_parse
to make a 20GBen/wiki.sqlite3
file.However when I run the next step
wikid_create_kb
, my machine runs out of memory in two places:retrieving entities here which i think i resolved by modifying
PRAGMA mmap_size
in the SQLite table.computing description vectors for all the entities here. What kind of machine did y'all get this to work on? My estimate says that 16GB of memory should be fine but this step is quickly crashing my computer.
Here is a spacy info dump:
Thank you!