CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Postprocessing and Formatting of Datasets #14

Open ncoop57 opened 1 year ago

ncoop57 commented 1 year ago

This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc.

Initial draft of postprocessing:

  1. Exact duplication removal
  2. Near duplication removal
  3. Removal of specific html tags

Questions for formatting:

  1. How to format forums?
  2. How to format general website articles?
  3. How to format books?