achen353 / TransformerSum

BERT-based extractive summarizer for long legal document using a divide-and-conquer approach
GNU General Public License v3.0
3 stars 0 forks source link

Fix 5 to extractive bug #6

Closed achen353 closed 2 years ago

achen353 commented 2 years ago

Context

5

Summary

  1. Adapt convert_to_extractive.py to BillSum and make set default arguments for BillSum
  2. Add code to deterministically split the train split of original BillSum into smaller train and valid (ratio = 8 : 2)
  3. Add helper functions to remove \n and extra white spaces as the very first preprocess step (before entering Spacy pipeline to do tokenization and other preprocessing)
  4. Update README