Open thomwolf opened 3 years ago
I have the same issue after inspecting the data downloaded from http://eaidata.bmk.sh/data/github_small.jsonl.zst. It seems the value of the 'file_name' key is identical for every repo.
This is a bug caused by https://github.com/EleutherAI/github-downloader/blob/345e7c4cbb9e0dc8a0615fd995a08bf9d73b3fe6/download_repo_text.py#L201C25-L201C49
They append the reference to the same dict every time, so, only the name and the type of the last file is stored in meta.
Hi,
Apologies if this is not the right place to note this but after downloading and exploring the preprocessed GitHub part of The Pile I've noted the metadata
file_name
are sometime a little off which can make it a bit harder to filter files based on file extension.For instance here, in the first sample of
data_114_time1601108762_default.jsonl
downloaded from https://the-eye.eu/public/AI/pile_preliminary_components/,file_name
is indicated to be jadx_termux.sh but this appears to be an extract from the changelog of the same repo.Not sure how important this is for people here but maybe it should be mentioned somewhere?