Closed forrestbao closed 1 year ago
It should involve change in evalbase (str
for every doc -> newly defined DocStr
including a str
and a segmented List[str]
)
Can you elaborate? What is the newly defined DocStr
? Is it defined in DocAsRef or in EvalBase?
If it's easier done than explaining to me, just modify EvalBase (fork or branch. do NOT do in main
branch directly) and make a PR?
I noticed some redundancy in our code at these two steps. For example, in
topk
, we segment sentences. And then inbertscore-sentence
andmnli
(so far mnli has only sentence-level version), sentence segmentation happen again. I also noticed that for extended periods of time, the CPU is busy while the GPU is not occupied. So I am afraid that we may have wasted a lot of time on tokenization and segmentation.Since nearly all our approaches requires word tokenziation and sentence segmentation, maybe we should have a preprocessing step for these two?
This will allow us to quickly try out different approaches. Of course, some models use their own word tokenization, especially those based on Transformers, because they do subword tokenzation and need to map tokens to integer IDs. In those cases, we can skip our preprocessing results.
Something like this
For example:
shall return