bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

add scripts to extract text and HTML metadata from the column `html_str` #370

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

This PR add the python and SLURM scripts to extract the text, the metadata_HTML, the html_header, the html_footer and the html_title in new columns.

cc @thomasw21

SaulLu commented 2 years ago

@thomasw21

I'm confused is this new code?

Yes, it's only new code because I needed to adapt it to our dataset (to our custom column names and the fact that all our examples haven't necessarily a value in html_str column. :slightly_smiling_face:

I did copy and paste sections of code from the metadata WG but they are modified to work in our case. The only snippet of code that is maybe not necessary is the one for ErrorWrapperPreprocessor but I didn't try to simply import it from bsmetadata.

thomasw21 commented 2 years ago

@HugoLaurencon I don't know if you have a script that actually merges PRs. I don't think we should have merged this (unless you reviewed it?)