bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
109 stars 20 forks source link

Add filtering to the near deduplicated safe dataset #5

Closed loubnabnl closed 1 year ago

loubnabnl commented 1 year ago

Dataset available at https://huggingface.co/datasets/BigCode/python_safe_license_dedup_and_filter the size goes from 74GB to 59GB