datastax / ragstack-ai

RAGStack is an out of the box solution simplifying Retrieval Augmented Generation (RAG) in AI apps.
https://www.datastax.com/products/ragstack
Other
121 stars 11 forks source link

EOFError when running the ColBERT example #418

Closed superkelvint closed 3 months ago

superkelvint commented 3 months ago

I tried running this code here: https://docs.datastax.com/en/ragstack/examples/colbert.html

from ragstack_colbert import ColbertEmbeddingModel, ChunkData

arctic_botany_dict = {
    "Introduction to Arctic Botany": "Arctic botany is the study of plant life in the Arctic, a region characterized by extreme cold, permafrost, and minimal sunlight for much of the year. Despite these harsh conditions, a diverse range of flora thrives here, adapted to survive with minimal water, low temperatures, and high light levels during the summer. This introduction aims to shed light on the resilience and adaptation of Arctic plants, setting the stage for a deeper dive into the unique botanical ecosystem of the Arctic.",
    "Arctic Plant Adaptations": "Plants in the Arctic have developed unique adaptations to endure the extreme climate. Perennial growth, antifreeze proteins, and a short growth cycle are among the evolutionary solutions. These adaptations not only allow the plants to survive but also to reproduce in short summer months. Arctic plants often have small, dark leaves to absorb maximum sunlight, and some species grow in cushion or mat forms to resist cold winds. Understanding these adaptations provides insights into the resilience of Arctic flora.",
    "The Tundra Biome": "The Arctic tundra is a vast, treeless biome where the subsoil is permanently frozen. Here, the vegetation is predominantly composed of dwarf shrubs, grasses, mosses, and lichens. The tundra supports a surprisingly rich biodiversity, adapted to its cold, dry, and windy conditions. The biome plays a crucial role in the Earth's climate system, acting as a carbon sink. However, it's sensitive to climate change, with thawing permafrost and shifting vegetation patterns.",
    "Arctic Plant Biodiversity": "Despite the challenging environment, the Arctic boasts a significant variety of plant species, each adapted to its niche. From the colorful blooms of Arctic poppies to the hardy dwarf willows, these plants form a complex ecosystem. The biodiversity of Arctic flora is vital for local wildlife, providing food and habitat. This diversity also has implications for Arctic peoples, who depend on certain plant species for food, medicine, and materials.",
    "Climate Change and Arctic Flora": "Climate change poses a significant threat to Arctic botany, with rising temperatures, melting permafrost, and changing precipitation patterns. These changes can lead to shifts in plant distribution, phenology, and the composition of the Arctic flora. Some species may thrive, while others could face extinction. This dynamic is critical to understanding future Arctic ecosystems and their global impact, including feedback loops that may exacerbate global warming.",
    "Research and Conservation in the Arctic": "Research in Arctic botany is crucial for understanding the intricate balance of this ecosystem and the impacts of climate change. Scientists conduct studies on plant physiology, genetics, and ecosystem dynamics. Conservation efforts are focused on protecting the Arctic's unique biodiversity through protected areas, sustainable management practices, and international cooperation. These efforts aim to preserve the Arctic flora for future generations and maintain its role in the global climate system.",
    "Traditional Knowledge and Arctic Botany": "Indigenous peoples of the Arctic have a deep connection with the land and its plant life. Traditional knowledge, passed down through generations, includes the uses of plants for nutrition, healing, and materials. This body of knowledge is invaluable for both conservation and understanding the ecological relationships in Arctic ecosystems. Integrating traditional knowledge with scientific research enriches our comprehension of Arctic botany and enhances conservation strategies.",
    "Future Directions in Arctic Botanical Studies": "The future of Arctic botany lies in interdisciplinary research, combining traditional knowledge with modern scientific techniques. As the Arctic undergoes rapid changes, understanding the ecological, cultural, and climatic dimensions of Arctic flora becomes increasingly important. Future research will need to address the challenges of climate change, explore the potential for Arctic plants in biotechnology, and continue to conserve this unique biome. The resilience of Arctic flora offers lessons in adaptation and survival relevant to global challenges."
}
arctic_botany_chunks = list(arctic_botany_dict.values())

# colbert stuff starts
colbert = ColbertEmbeddingModel()

chunks = [ChunkData(text=text, metadata={}) for text in arctic_botany_chunks]
embedded_chunks = colbert.embed_chunks(chunks=chunks, doc_id="arctic botany")
print(embedded_chunks)

I get this error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/tmp/test_colbert.py", line 19, in <module>
    embedded_chunks = colbert.embed_chunks(chunks=chunks, doc_id="arctic botany")
  File "/tmp/test_colbert/venv/lib/python3.10/site-packages/ragstack_colbert/colbert_embedding_model.py", line 176, in embed_chunks
    embedded_texts = self._encode_texts_using_distributed(texts=texts, timeout=timeout)
  File "/tmp/test_colbert/venv/lib/python3.10/site-packages/ragstack_colbert/colbert_embedding_model.py", line 256, in _encode_texts_using_distributed
    return runner.encode(
  File "/tmp/test_colbert/venv/lib/python3.10/site-packages/ragstack_colbert/distributed/runner.py", line 134, in encode
    manager = mp.Manager()
  File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 562, in start
    self._process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "/tmp/test_colbert.py", line 19, in <module>
    embedded_chunks = colbert.embed_chunks(chunks=chunks, doc_id="arctic botany")
  File "/tmp/test_colbert/venv/lib/python3.10/site-packages/ragstack_colbert/colbert_embedding_model.py", line 176, in embed_chunks
    embedded_texts = self._encode_texts_using_distributed(texts=texts, timeout=timeout)
  File "/tmp/test_colbert/venv/lib/python3.10/site-packages/ragstack_colbert/colbert_embedding_model.py", line 256, in _encode_texts_using_distributed
    return runner.encode(
  File "/tmp/test_colbert/venv/lib/python3.10/site-packages/ragstack_colbert/distributed/runner.py", line 134, in encode
    manager = mp.Manager()
  File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in start
    self._address = reader.recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

However, when I run it from a Python console, it works OK.

superkelvint commented 3 months ago

Moving the code to a main block seemed to fix it.