explosion / wikid

Generate a SQLite database from Wikipedia & Wikidata dumps.
MIT License
31 stars 6 forks source link

`create_kb` step using an unfiltered dump runs out of memory #32

Open c-koster opened 1 year ago

c-koster commented 1 year ago

Hello!

I am working to create a knowledge base using the latest (unfiltered) English wiki dumps. I've successfully followed the steps in benchmarks/nel up to wikid_parse to make a 20GB en/wiki.sqlite3 file.

However when I run the next step wikid_create_kb, my machine runs out of memory in two places:

  1. retrieving entities here which i think i resolved by modifying PRAGMA mmap_size in the SQLite table.

  2. computing description vectors for all the entities here. What kind of machine did y'all get this to work on? My estimate says that 16GB of memory should be fine but this step is quickly crashing my computer.

    Running command: env PYTHONPATH=scripts python ./scripts/create_kb.py en_core_web_lg en
    Inferring entity embeddings:   1%|▎  

Here is a spacy info dump:

spaCy version    3.7.2                         
Location         /opt/conda/lib/python3.10/site-packages/spacy
Platform         Linux-5.10.0-26-cloud-amd64-x86_64-with-glibc2.31
Python version   3.10.12                       
Pipelines        en_core_web_lg (3.7.0)  

Thank you!

rmitsch commented 1 year ago

Hi @c-koster! Thanks for bringing that up. We indeed ran this on a machine with a lot of memory (120 GB), so it's possible we overlooked memory bottlenecks. Can you post the complete stack trace?

One way we could fix this would be to turn these two calls into generators.

culton-nwo commented 1 year ago

Hi, @rmitsch!

I re-ran the script with en_core_web_md and had the same issues. I was unable to recover a stack trace (the command fails silently). However here is a line from from /var/log/messages:

Oct 30 16:37:42 ck-wikidata kernel: [428663.963815] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=gce_instance_monitor.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-3.scope,task

I'm currently looking into limiting the results based on link count. A threshold of 20 links (which i got from this spaCy video on entity linking) yields about 1.8M entities.

Also, re: making the create_kb step more memory efficient, I'd be happy to take this on. Particularly I see two changes to the code which are probably separate:

  1. modifying the SQL queries to allow for link count thresholds
  2. replacing the list comprehensions with generators and/or setting a good mmap limit in the ddl file

Thanks for your help!

rmitsch commented 1 year ago

Also, re: making the create_kb step more memory efficient, I'd be happy to take this on. Particularly I see two changes to the code which are probably separate:

  1. modifying the SQL queries to allow for link count thresholds
  2. replacing the list comprehensions with generators and/or setting a good mmap limit in the ddl file

Yes, these are both good options. Ideally we'd get this working without the necessity of link count thresholds, as this is a magic number that might be hard to set properly for users. Anyway, this will require a deeper look into the memory bottlenecks in wikid. If you want to look into this, happy to support you along the way! Otherwise I'll update here once we got around to doing so.

c-koster commented 1 year ago

Hello @rmitsch

I did some investigating of the the create_kb step with a memory profiler (both a time-based sampling approach, and a line-by-line profile of the main function).

First, here are some lines from a profile of the create_kb step with filtered dumps:

19  220.223 MiB  220.223 MiB           1   @profile
    20                                         def main(vectors_model: str, language: str):
    27 1036.113 MiB  815.891 MiB           1       nlp = spacy.load(vectors_model, exclude=["tagger", "lemmatizer", "attribute_ruler"])
    29 1036.113 MiB    0.000 MiB           1       [logger.info](http://logger.info/)("Constructing knowledge base.")
    30 1036.113 MiB    0.000 MiB           1       kb = DefaultKB(vocab=nlp.vocab, entity_vector_length=nlp.vocab.vectors_length)
    31 1036.113 MiB    0.000 MiB           1       entity_list: List[str] = []
    32 1036.113 MiB    0.000 MiB           1       count_list: List[int] = []
    33 1036.113 MiB    0.000 MiB           1       vector_list: List[numpy.ndarray] = []  # type: ignore
    34 1147.918 MiB  111.805 MiB           1       entities = wiki.load_entities(language=language)
    35 1149.723 MiB    1.289 MiB       76016       ent_descriptions = {
    36 1149.723 MiB    0.000 MiB       76012           qid: entities[qid].description
    37 1149.723 MiB    0.000 MiB       38006           if entities[qid].description
    38                                                 else (
    39 1149.723 MiB    0.516 MiB        3063               entities[qid].article_text[:200]
    40 1149.723 MiB    0.000 MiB        3063               if entities[qid].article_text
    41 1149.723 MiB    0.000 MiB         866               else entities[qid].name
    42                                                 )
    43 1149.723 MiB    0.000 MiB       38007           for qid in entities.keys()
    44                                             }
    47 1188.328 MiB   36.246 MiB       76016       desc_vectors = [
    48 1187.191 MiB    0.105 MiB       38006           doc.vector
    49 1187.191 MiB    0.387 MiB       38008           for doc in tqdm.tqdm(
    50 1149.723 MiB    0.000 MiB           2               nlp.pipe(
    51 1149.723 MiB    0.000 MiB       38009                   texts=[ent_descriptions[qid] for qid in entities.keys()], n_process=-1
    52                                                     ),
    53 1149.723 MiB    0.000 MiB           1               total=len(entities),
    54 1149.723 MiB    0.000 MiB           1               desc="Inferring entity embeddings",
    55                                                 )
    56                                             ]
    57 1186.465 MiB   -1.863 MiB       38007       for qid, desc_vector in zip(entities.keys(), desc_vectors):
    58 1186.465 MiB    0.000 MiB       38006           entity_list.append(qid)
    59 1186.465 MiB    0.000 MiB       38006           count_list.append(entities[qid].count)
    60 1186.465 MiB    0.000 MiB       76012           vector_list.append(
    61 1186.465 MiB    0.000 MiB       38006               desc_vector if isinstance(desc_vector, numpy.ndarray) else desc_vector.get()
    62                                                 )
    63 1233.895 MiB   47.430 MiB           2       kb.set_entities(
    64 1186.465 MiB    0.000 MiB           1           entity_list=entity_list, vector_list=vector_list, freq_list=count_list
    65                                             )
    66
    67                                             # Add aliases with normalized priors to KB. This won't be necessary with a custom KB.
    68 1261.734 MiB   27.840 MiB           2       alias_entity_prior_probs = wiki.load_alias_entity_prior_probabilities(
    69 1233.895 MiB    0.000 MiB           1           language=language
    70                                             )
    71 1271.395 MiB    0.250 MiB       62533       for alias, entity_prior_probs in alias_entity_prior_probs.items():
    72 1271.395 MiB    7.020 MiB      125064           kb.add_alias(
    73 1271.395 MiB    0.000 MiB       62532               alias=alias,
    74 1271.395 MiB    1.137 MiB      258330               entities=[epp[0] for epp in entity_prior_probs],
    75 1271.395 MiB    1.254 MiB      258330               probabilities=[epp[1] for epp in entity_prior_probs],
    76                                                 )
    77                                             # Add pseudo aliases for easier lookup with new candidate generators.
    78 1280.598 MiB    1.480 MiB       38007       for entity_id in entity_list:
    79 1280.598 MiB    7.074 MiB       76012           kb.add_alias(
    80 1280.598 MiB    0.648 MiB       38006               alias="_" + entity_id + "_", entities=[entity_id], probabilities=[1]
    81                                                 )
    82
    83                                             # Serialize knowledge base & pipeline.
    84 1280.598 MiB    0.000 MiB           1       output_dir = Path(os.path.abspath(__file__)).parent.parent / "output"
    85 1332.773 MiB   52.176 MiB           1       [kb.to](http://kb.to/)_disk(output_dir / language / "kb")
    86 1332.773 MiB    0.000 MiB           1       nlp_dir = output_dir / language / "nlp"
    87 1332.773 MiB    0.000 MiB           1       os.makedirs(nlp_dir, exist_ok=True)

Something I notice is that aside from the spacy.load step (which will be constant for the unfiltered dumps), the wiki.load_entities step is the most expensive. I think the encodings and creation of knowledge base entities could be done all at once (without incurring this memory cost) if the load_entities function returns a generator instead of a dictionary.

I also did a time-based analysis of the full (unfiltered) step but it made a lot less sense. The memory used spikes to about 20GB and then falls off. I suspect this is related to a very large group-by query being run locally but haven't tested this thoroughly.

memplot

Some questions:

  1. What is the motivation to have the load_entities function return a dictionary of QIDs to Entities? Could this be modified to return a generator to save some memory?
  2. In lieu of tests for the create_kb step, how will i know if the code is still correct? Will two knowledge bases have identical hashes or diffs if entities are passed in different orders?

Thanks!