bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
115 stars 20 forks source link

Add filtering to the near deduplicated safe dataset #5

Closed loubnabnl closed 2 years ago

loubnabnl commented 2 years ago

Dataset available at https://huggingface.co/datasets/BigCode/python_safe_license_dedup_and_filter the size goes from 74GB to 59GB