bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

add files to compute basic stats on pseudo crawl dataset #382

Open SaulLu opened 2 years ago

SaulLu commented 2 years ago

As discussed with @thomasw21, this PR add basic slurm and python scripts to compute an intermiadiary metadata dataset and some statistics for the Pseudo Crawl dataset

HugoLaurencon commented 2 years ago

@SaulLu can you edit and merge? Thanks!

SaulLu commented 2 years ago

@HugoLaurencon, do you need it urgently ?

HugoLaurencon commented 2 years ago

@SaulLu No, no worries! It was just to tag you in case you didn't see the comments from Thomas