achen353 / TransformerSum

BERT-based extractive summarizer for long legal document using a divide-and-conquer approach

GNU General Public License v3.0

3 stars 0 forks source link

Fix 5 to extractive bug #6

Closed achen353 closed 2 years ago

achen353 commented 2 years ago

Context

5

Summary

Adapt convert_to_extractive.py to BillSum and make set default arguments for BillSum
Add code to deterministically split the train split of original BillSum into smaller train and valid (ratio = 8 : 2)
Add helper functions to remove \n and extra white spaces as the very first preprocess step (before entering Spacy pipeline to do tokenization and other preprocessing)
Update README