Open jomieljaniuk opened 1 year ago
Did you find an answer to this? I'm having the same issue
You have to change prepdocs.py script to split text in range of one page:
# make split_text for each page separately
def split_text(page_map):
SENTENCE_ENDINGS = [".", "!", "?"]
WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"]
if args.verbose: print(f"Splitting '{filename}' into sections")
# make split_text for each page separately
for (page_num, _, page_text) in page_map:
length = len(page_text)
start = 0
end = length
while start + SECTION_OVERLAP < length:
Then, each text section as a source will be come from only one page.
You have to change prepdocs.py script to split text in range of one page:
# make split_text for each page separately def split_text(page_map): SENTENCE_ENDINGS = [".", "!", "?"] WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"] if args.verbose: print(f"Splitting '{filename}' into sections") # make split_text for each page separately for (page_num, _, page_text) in page_map: length = len(page_text) start = 0 end = length while start + SECTION_OVERLAP < length:
Then, each text section as a source will be come from only one page.
@jomieljaniuk I'm having the exact same problem too. Does the modified source work well? Can I see the full contents of the modified prepdocs.py file?
Thank you in advance. ^^
Some additional comments :
Hi @hello-lui, I revised the code to create sections only for single page. Moreover, I removed the OVERLAP by setting it to 0. Here the code for people interested on this topic.
def split_text(page_map):
SENTENCE_ENDINGS = [".", "!", "?"]
WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"]
# make split_text for each page separately
for (page_num, _, page_text) in page_map:
length = len(page_text)
start = 0
end = length
all_text = page_text
while start + SECTION_OVERLAP < length:
last_word = -1
end = start + MAX_SECTION_LENGTH
if end > length:
end = length
else:
# Try to find the end of the sentence
while (
end < length
and (end - start - MAX_SECTION_LENGTH) < SENTENCE_SEARCH_LIMIT
and all_text[end] not in SENTENCE_ENDINGS
):
if all_text[end] in WORDS_BREAKS:
last_word = end
end += 1
if end < length and all_text[end] not in SENTENCE_ENDINGS and last_word > 0:
end = last_word # Fall back to at least keeping a whole word
if end < length:
end += 1
# Try to find the start of the sentence or at least a whole word boundary
last_word = -1
while (
start > 0
and start > end - MAX_SECTION_LENGTH - 2 * SENTENCE_SEARCH_LIMIT
and all_text[start] not in SENTENCE_ENDINGS
):
if all_text[start] in WORDS_BREAKS:
last_word = start
start -= 1
if all_text[start] not in SENTENCE_ENDINGS and last_word > 0:
start = last_word
if start > 0:
start += 1
section_text = all_text[start:end]
yield (section_text, page_num)
last_table_start = section_text.rfind("<table")
if last_table_start > 2 * SENTENCE_SEARCH_LIMIT and last_table_start > section_text.rfind("</table"):
# If the section ends with an unclosed table, we need to start the next section with the table.
# If table starts inside SENTENCE_SEARCH_LIMIT, we ignore it, as that will cause an infinite loop for tables longer than MAX_SECTION_LENGTH
# If last table starts inside SECTION_OVERLAP, keep overlapping
start = min(end - SECTION_OVERLAP, start + last_table_start)
else:
start = end - SECTION_OVERLAP
if start + SECTION_OVERLAP < end:
yield (all_text[start:end], page_num)
Thank you for the code! It seems to be working well to get the correct citation, but doesn't it also remove the continuity in the text? If I have some text that starts on page x and ends on page x+1, with the updated code I'm only getting the information from page x. With the initial code the answer was complete.
I mean something like this: Page x: .... Here is the list of programming languages:
Page x+1:
With a prompt "Please show me the programming languages" I was getting "Python, Java, C++, Go" with the initial code and only "Python, Java" with this version
Hi @elhele yes you are right but i think the approach to maintain continuity or not dependes by the use case. In my use case, by maintaining continuity, the text was not correctly parsed and also it was difficult to the page the correct page for the citation (the file contains a lot of table and HTML tags). You should decide the approach based on the use case you are working on.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Expected/desired behavior
Mention any other details that might be useful