redundancy in word tokenization and sentence segmentation

forrestbao commented 1 year ago

I noticed some redundancy in our code at these two steps. For example, in topk, we segment sentences. And then in bertscore-sentence and mnli (so far mnli has only sentence-level version), sentence segmentation happen again. I also noticed that for extended periods of time, the CPU is busy while the GPU is not occupied. So I am afraid that we may have wasted a lot of time on tokenization and segmentation.

Since nearly all our approaches requires word tokenziation and sentence segmentation, maybe we should have a preprocessing step for these two?

This will allow us to quickly try out different approaches. Of course, some models use their own word tokenization, especially those based on Transformers, because they do subword tokenzation and need to map tokens to integer IDs. In those cases, we can skip our preprocessing results.

Something like this

tokenize_and_segment(docs: List[str]) -> List[List[List[[str]]], List[List[str]]:

For example:

tokenzie_and_segment([ 
        "I am happy. Today is Sunday. ",  # first document 
        "I am sad. Pizza is cold. ",  # second document 
    ]
    )

shall return

(
[    # words 
    [["I", "am", "happy".], ["Today", "is", "Sunday"]] , # first document's words 
    [["I", "am", "sad".], ["Pizza", "is", "cold"]] , #  second document's words 
], 
[  # sentences 
   ["I am happy.", "Today is Sunday."], # first sentence 
   ["I am sad.", "Pizza is cold ."] # second sentence 
]
)

TURX commented 1 year ago

It should involve change in evalbase (str for every doc -> newly defined DocStr including a str and a segmented List[str])

forrestbao commented 1 year ago

Can you elaborate? What is the newly defined DocStr? Is it defined in DocAsRef or in EvalBase? If it's easier done than explaining to me, just modify EvalBase (fork or branch. do NOT do in main branch directly) and make a PR?

SigmaWe / DocAsRef

redundancy in word tokenization and sentence segmentation #7