achen353 / TransformerSum

BERT-based extractive summarizer for long legal document using a divide-and-conquer approach
GNU General Public License v3.0
3 stars 0 forks source link

Preprocess BillSum Dataset #2

Closed achen353 closed 2 years ago

achen353 commented 2 years ago

TODO

achen353 commented 2 years ago

Preprocess of Sample Datasets

Processing of the CNN/DM dataset

Documentation

  1. Download the dataset through its respective python script (from other repo)
  2. Convert it into train, val, test splits
  3. Process the data using convert_to_extractive.py

    Processing of WikiHow

    Documentation

  4. Download through a custom script like CNN/DM and split the dataset
  5. Process the data using convert_to_extractive.py

    Processing of the Arxiv/PubMed dataset

    Documentation

  6. Download through a custom script like CNN/DM and split the dataset
  7. Process the data using convert_to_extractive.py except some minor hyper-parameter settings tailored for long text Screen Shot 2021-11-07 at 11 13 31 PM
achen353 commented 2 years ago

Data Preparation for Extractive

Since most of the summarization datasets contain abstractive summarizations as the label (easier for human annotators to produce/collect abstractive summarization rather than extractive one), the author of TransformerSum provides convert_to_extractive.py. This is compatible with any custom dataset and very conveniently with any HuggingFace Datasets as well.

We can simply use the billsum dataset on HuggingFace Datasets and convert it to the annotation for extractive summarization through convert_to_extractive.py.

(In Progress) How convert_to_extractive.py works:

  1. It uses Spacy processing pipeline to tokenize the data (not sure yet if other operations are applied).
  2. With tokenized data, it further performs sentence segmentation (identifying the boundaries of each sentence).
  3. Lastly, it measures how relevant each sentence is with the given abstractive summarization annotation (still working on understanding the implementation, most likely through some similarity measures) and assign them binary values 1 and 0 for being part of / not part of the extractive summarization.

Text after tokenization (Step 1) and sentence segmentation (Step 2):

Screen Shot 2021-11-07 at 11 36 43 PM

The binary label 1 and 0 corresponding to each sentence above (Step 3). Sentences with labels 1 will be concatenated and considered as the "extractive" summarization ground truth

Screen Shot 2021-11-07 at 11 37 58 PM

The original "abstractive" summarization

Screen Shot 2021-11-07 at 11 41 14 PM

Data Preparation for Abstractive

For abstractive summarization, since the billsum dataset contains abstractive summarization as annotation in the first place and is available on HuggingFace, we can simply use abstractive.py to do the training directly by specifying the dataset name.

achen353 commented 2 years ago

@stephanieeechang I fixed convert_to_extractive.py so it now works for BillSum (but in the original TransformerSum fashion, not taking into account the fact that our data is structured in bullet points and our idea of concatenation)

About concatenating the bullet points in the BillSum dataset, can you experiment with performing concatenating at different levels over the weekend? Focus on getting this information:

  1. How many layers of bullet points (the hierarchy) are there? (minimum, average, and maximum)
  2. What is the length of the text in each layer? (minimum, average, and maximum)
  3. If concatenating only the LAST (bottom-most) layer of bullet points, what would be the length of the text that are concatenated (minimum, average, and maximum)
  4. If concatenating the LAST TWO (bottom-most) layers of bullet points into one chunk, would that work? What would be the length of the text that are concatenated (minimum, average, and maximum)
achen353 commented 2 years ago

Also take screenshots or give examples of the concatenated result

achen353 commented 2 years ago

@stephanieeechang @andywang268 What we discussed today is a little similar to this: Paper: https://arxiv.org/pdf/2004.06190.pdf GitHub: https://github.com/AlexGidiotis/DANCER-summ

Each of our training data (each document) is already divided into sections that are likely to be under 512 tokens. Given a document d = {s_1, s_2, ..., s_n}, where s_i is a section out of n sections, and the target abstractive summary S_a = {t_1, t_2, ..., t_m}, where t_j is a sentence out of the m sentences of the summary, we can use similar strategies to assign each t_j to a document section s_i such that each t_j is mapped to a single s_i but each s_i can correspond to multiple t_j's, which form the corresponding section summary. We then use TransformerSum's built-in abstractive-extractive summary conversion to label each sentence in each section with 0 and 1 as the extractive section summary ground truth. After BERT produces an extractive summary for each section, we simply concatenate them together.