对论文中的llama-index切词问题不太明白

BUAADreamer / EasyRAG

Easy-to-Use RAG Framework; CCF AIOps International Challenge 2024 Top3 Solution; CCF AIOps 国际挑战赛 2024 季军方案

https://arxiv.org/abs/2410.10315

MIT License

182 stars 23 forks source link

对论文中的llama-index切词问题不太明白 #5

Closed weibingo closed 3 weeks ago

weibingo commented 3 weeks ago

论文中说： we found that the original implementation of llama-index used a simple but unstable method of handling path information, subtracting the file path length from the text length to determine the actual text length used. This approach could cause different segmentation results with the same chunksize and chunk-overlap, depending on the data path.

不太理解，能以文本数据举例说明下么

BUAADreamer commented 3 weeks ago

就是llama-index会根据当前的文件路径和文本长度决定分块，比如同一个x.txt文件，文件路径分别为/home/path/x.txt和/home/path/dreamer/x.txt，只是放在了不同的位置，但会带来完全不一样的分块结果，实际部署时就可能会带来检索、生成和测试时不一致（需要保证路径完全一致才能复现）

BUAADreamer commented 3 weeks ago

因此，我们主要是在获取file_path这里做了以下实现，保证放在哪个路径，都使用根目录下的相对路径，可以参考 https://github.com/BUAADreamer/EasyRAG/blob/51964f57029ab0bc1a581303c2af33eec100748d/src/easyrag/custom/transformation.py#L69C1-L70C1

weibingo commented 3 weeks ago

好的，我看下。

BUAADreamer commented 2 weeks ago

可以再看一下博客里的相关部分：https://zhuanlan.zhihu.com/p/7272025344

好的，我看下。

BUAADreamer / EasyRAG

对论文中的llama-index切词问题 不太明白 #5

对论文中的llama-index切词问题不太明白 #5