IAAR-Shanghai / Meta-Chunking

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception
Apache License 2.0
84 stars 3 forks source link

如何使用API替换本地LLM? #1

Closed pimooook closed 3 weeks ago

pimooook commented 1 month ago

您好,我想用deepseek的API替换qwen,请问应该怎么修改呢

Robot2050 commented 1 month ago

您好!目前该方法主要针对使用本地中小型模型进行海量文本分块的场景,因为这样可以实现性能和效率的更好平衡,同时应对小模型指令遵循能力弱问题。为了让小模型能够实现分块,Margin Sampling Chunking 和 Perplexity Chunking均需要使用模型预测的token的概率(参阅example/app.py)。使用大模型API只能拿到输出的文字,拿不到token的概率,而且会导致资金消耗很大。所以暂时没有办法为您提供deepseek API替换本地模型的方法。但是根据我们的调研和实验经验,可以为您提供一些可行的建议。 如果您不需要考虑消耗问题,由于大模型API指令遵循能力很强,您可以直接使用大模型输出指定的文字来指导分块。例如,通过Prompt询问大模型API两个句子(是/否)符合分割条件,或(为了提高处理速度)让大模型在若干句子中找到应该分割的那个句子。

halaneji commented 1 month ago

how can i use your model for testing please ?? I have this error : python example/app.py Traceback (most recent call last): File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status response.raise_for_status() File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/Qwen2-1.5B-Instruct/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/transformers/utils/hub.py", line 402, in cached_file resolved_file = hf_hub_download( File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, *kwargs) File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download return _hf_hub_download_to_cache_dir( File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1484, in _raise_on_head_call_error raise head_call_error File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1376, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, **kwargs) File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata r = _request_wrapper( File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper response = _request_wrapper( File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper hf_raise_for_status(response) File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 454, in hf_raise_for_status raise _format(RepositoryNotFoundError, message, response) from e huggingface_hub.errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-671babd8-382ce5f57fa8cb17670c7df1;8d473559-2ff8-4d1f-808c-ca67b7a2eb8c)

Repository Not Found for url: https://huggingface.co/Qwen2-1.5B-Instruct/resolve/main/tokenizer_config.json. Please make sure you specified the correct repo_id and repo_type. If you are trying to access a private or gated repo, make sure you are authenticated. Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/hala/work/segmentation/Meta-Chunking/example/app.py", line 9, in small_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,trust_remote_code=True) File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 834, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 666, in get_tokenizer_config resolved_config_file = cached_file( File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/transformers/utils/hub.py", line 425, in cached_file raise EnvironmentError( OSError: Qwen2-1.5B-Instruct is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>

Robot2050 commented 1 month ago

You can first download the Qwen model weights locally and then configure the absolute path of the weights to 'model_name_or_path' in the app.py file as '/your/model/path'.

Robot2050 commented 1 month ago

You can find the model weights through either of the following two links.

https://modelscope.cn/models https://huggingface.co/models

halaneji commented 1 month ago

could you please write the name of the model, there is a big list of models in this link: https://modelscope.cn/models

Robot2050 commented 1 month ago

Qwen2-1.5B-Instruct or https://modelscope.cn/models/Qwen/Qwen2-1.5B-Instruct/files

If you have successfully run the code with this model, you can try using other models as well. Because our method is universal and not limited to any specific model.

halaneji commented 1 month ago

Many thanks ! This model can be used for Latin text ??

halaneji commented 1 month ago

always I have this problem : python example/app.py Traceback (most recent call last): File "app.py", line 10, in small_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True,device_map=device_map) File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/home/hala/anaconda3/envs/env3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3993, in from_pretrained with safe_open(resolved_archive_file, framework="pt") as f: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

Robot2050 commented 1 month ago

Thank you very much for your attention to our project. Currently, we have primarily conducted testing in Chinese and English. While the Qwen series of LLMs support multiple languages, their generalization capability in Latin may be relatively poor. You could consider replacing it with a large language model that has undergone better training in Latin.

This issue may be caused by incomplete download of the model weights. You can use following commands or manually download each file one by one. On Ubuntu: sudo apt-get update sudo apt-get install git-lfs git clone https://www.modelscope.cn/Qwen/Qwen2-1.5B-Instruct.git

halaneji commented 1 month ago

Thank you so much for your quick response and congratulations on this great job! I will install again the model and I will see.

halaneji commented 1 month ago

Screenshot from 2024-10-25 18-01-18 image

Robot2050 commented 1 month ago

This error seems to be caused by the system's inability to recognize these characters. Currently, Python packages and models may not be well-adapted to languages other than Chinese and English. One feasible solution is to convert them into English before processing them in chunks. We apologize for not being able to resolve the issue you are facing. There is still much room for improvement in models and methods when dealing with multiple languages.