what if i want indexing my database but not language by English ？

Lbaiall commented 5 months ago

i put in my database to replace the sample_corpus.json，and indexing it without erro ,but when i try to retriveal with my queries,it only ouput with the EN format ，but my database source is Chinese ，it‘s weird ..................... (.venv) root@DESKTOP-9O20ND7:/home/ai/HippoRAG# bash test.sh Building Graph: 100%|█████████████████████████████████████████████████████████████| 406/406 [00:00<00:00, 941862.51it/s] Graph built: num vertices: 182 num edges: 404 [Jun 16, 16:59:37] #> Loading codec... [Jun 16, 16:59:37] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)... [Jun 16, 16:59:37] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)... [Jun 16, 16:59:37] #> Loading IVF... [Jun 16, 16:59:37] #> Loading doclens... 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4888.47it/s] [Jun 16, 16:59:37] #> Loading codes and residuals... 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 64.21it/s] [Jun 16, 16:59:37] #> Loading collection... 0M [Jun 16, 16:59:38] #> Loading codec... [Jun 16, 16:59:38] #> Loading IVF... [Jun 16, 16:59:38] #> Loading doclens... 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3063.77it/s] [Jun 16, 16:59:38] #> Loading codes and residuals... 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 75.54it/s] 1it [00:00, 32.58it/s] 1it [00:00, 145.25it/s] pagerank chunk: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 860.90it/s] Query: 什么是分类号？ Ranks: [3, 87, 64, 57, 14, 85, 79, 41, 68, 6] Scores: [1.0, 0.6871852830001737, 0.5621990716874753, 0.4997289010333189, 0.49937378235007723, 0.43714712744004813, 0.4370691002865559, 0.37531060829747986, 0.3747681936977789, 0.3747647873312091] Logs: {'named_entities': [''], 'linked_node_scores': [['', '', 1.0]], '1-hop_graph_for_linked_nodes': [['library book shelving', 1.0], ['document type number', 2.0], ['classification number', 5.0], ['sequence number', 5.0], ['book label', 1.0], ['one book', 1.0], ['book s position on shelf', 1.0], ['reader to find books', 1.0], ['book in library collection', 1.0], ['chinese book', 1.0], ['tp311 56', 1.0], ['54', 1.0], ['library collection book', 1.0], ['171 50', 2.0], ['1000', 2.0], ['met', 1.0], ['xopac', 1.0], ['vipexam', 1.0], ['apabi', 2.0], ['237', 1.0], ['library book shelving', 1.0, 'inv'], ['document type number', 2.0, 'inv'], ['classification number', 5.0, 'inv'], ['sequence number', 5.0, 'inv'], ['book label', 1.0, 'inv'], ['one book', 1.0, 'inv'], ['book s position on shelf', 1.0, 'inv'], ['reader to find books', 1.0, 'inv'], ['book in library collection', 1.0, 'inv'], ['chinese book', 1.0, 'inv'], ['tp311 56', 1.0, 'inv'], ['54', 1.0, 'inv'], ['library collection book', 1.0, 'inv'], ['171 50', 2.0, 'inv'], ['1000', 2.0, 'inv'], ['met', 1.0, 'inv'], ['xopac', 1.0, 'inv'], ['vipexam', 1.0, 'inv'], ['apabi', 2.0, 'inv'], ['237', 1.0, 'inv']], 'top_ranked_nodes': ['', '1', '5', 'sequence number', 'classification number', '30', '3', '209', '1 4', 'mac', 'wan', 'internet', '10', '10 1', 'device s browser', '10 2', '2013 10 15', 'win7', 'win10', 'document type number'], 'nodes_in_retrieved_doc': [['', '54', 'book in library collection', 'book label', 'book s position on shelf', 'chinese book', 'classification number', 'document type number', 'library book shelving', 'library collection book', 'one book', 'reader to find books', 'sequence number', 'tp311 56'], ['', '114', '116114', '12580', '2013 10 15'], ['', 'dns', 'internet', 'lan', 'qq'], ['', 'browser', 'client', 'device s browser', 'inenu 2g', 'inenu 5g', 'wifi', 'wired user', 'wireless user'], ['']]}

yhshu commented 5 months ago

This may require multiple steps of adaptation:

The extraction model needs to support other languages: a slight modification of the GPT prompt should be able to do the extraction for other languages.
The retriever needs to be replaced with a model in another language: contriever or colbertv2 may not support Chinese, you can look up other models on huggingface.

Since you only posted your retrieval process here, I don't know how you do the extraction. You may have to start with the adaptation for extraction/indexing first. See the indexing section in README.md for more details.

Lbaiall commented 5 months ago

@yhshu appreciate it! the frist step i have try with the with diffrent prompt to in the code,but nothing changed ,and second step i will try it later,i will telling you the new result as i test it

OSU-NLP-Group / HippoRAG

what if i want indexing my database but not language by English ？ #20