Tạo function để nhét input với số lượng token lớn hơn limit của model dịch và model tóm tắt

CoderHung commented 4 months ago

model đang dùng(mTet) có giới hạn token là 512 tokens và model tóm tắt hiện tại có limit là 1024 tokens , limit thấp hơn nhiều so với mục tiêu (2000 tokens), nếu có input với token count lớn hơn limit thì ouput sẽ bị ảnh hưởng rất nhiều. Phải có cách mà fit cái input vào model mà không ảnh hưởng chất lượng output

CoderHung commented 4 months ago

Giải pháp cho model dịch

code để tách input thành các segment mà có token count = 300

sentences = nltk.sent_tokenize(text)
    segments = []
  code để tách input thành các segment mà có token count  = 300  segment_token_count = 0
    segment = []
    threshold_token_count = 300
    for sentence in sentences:
        tokens = nltk.word_tokenize(sentence)
        segment_token_count += len(tokens)
        if segment_token_count >= threshold_token_count:
            segments.append(" ".join(segment))
            segment = []
            segment_token_count = len(tokens)
        segment.append(sentence)
    if segment:
        segments.append(" ".join(segment))

code để dịch từng segment một rồi join lại thành bản dịch cuối

translated_segments = []
    for segment in segments:
        outputs = model.generate(
            tokenizer(
                Language + segment, return_tensors="pt", padding=True
            ).input_ids.to("cuda"),
            max_length=512,
        )
        translated_segments.append(
            tokenizer.batch_decode(outputs, skip_special_tokens=True)[0][4:]
        )
    final = " ".join(translated_segments)

khi làm như vậy ta đã có thể có input với bất kỳ token count nào, token count càng cao thì chỉ ảnh hưởng đến thời gian chạy thôi

CoderHung commented 4 months ago

Giải pháp cho model tóm tắt

code để tách input thành các chunk mà overlap với nhau (sliding window chunking method)

def nltk_chunk(text, chunk_length, overlap_size):
    """sliding window chunking method using using NLTKTextSplitter

    Args:
        text (str): input text
        chunk_length (int): length of chunks
        overlap_size (int): the overlap size of chunks

    Returns:
        List[str]: list of chunks of type str
    """

    # 2000,1000
    nltk_splitter = NLTKTextSplitter(
        separator=" ", chunk_size=chunk_length, chunk_overlap=overlap_size
    )
    splits = nltk_splitter.split_text(text)
    return splits

code để tóm tắt từng chunk rồi gộp thành input cuối

def summarize_chunks(chunks, chunk_size, max_value, min_value):
    """Summarize each chunk, returns combined chunk summaries

    Args:
        chunks (_type_): list of chunk strings
        chunk_size (int): size of chunks in length
        max_value (int): max output token length
        min_value (int): min output token length

    Returns:
        str: the combined chunk summaries
    """
    summarized_chunks = []
    encoding = tiktoken.get_encoding("cl100k_base")
    for chunk in chunks:
        if len(chunk) < chunk_size:
            continue
        summarized_chunk = summarizer(
            chunk, max_length=max_value, min_length=min_value, do_sample=False
        )
        summarized_chunks.append(summarized_chunk[0]["summary_text"])
    return " ".join(summarized_chunks)

khi làm như vậy ta đã có thể có input với 2048 >token count > 1024 , token count hơn 2048 thì vẫn sẽ mắc lỗi

96ers / summerizIT

Tạo function để nhét input với số lượng token lớn hơn limit của model dịch và model tóm tắt #38