EleutherAI / the-pile

MIT License
1.44k stars 122 forks source link

Meta data `file_name` in the GitHub part of The Pile a bit off #90

Open thomwolf opened 3 years ago

thomwolf commented 3 years ago

Hi,

Apologies if this is not the right place to note this but after downloading and exploring the preprocessed GitHub part of The Pile I've noted the metadata file_name are sometime a little off which can make it a bit harder to filter files based on file extension.

For instance here, in the first sample of data_114_time1601108762_default.jsonl downloaded from https://the-eye.eu/public/AI/pile_preliminary_components/, file_name is indicated to be jadx_termux.sh but this appears to be an extract from the changelog of the same repo.

Not sure how important this is for people here but maybe it should be mentioned somewhere?

{
 "text": "## [1.1]\n### Added\n- Added Update() for auto-update\n\n## [1.2]\n### Added\n- extra flag or option `-a` to use __aapt2__ instead of __aapt__.\n- Issue template\n### Changed\n- use getopts for parameters handling\n### Fixed\n- fix update()\n\n## [1.3]\n### Added\n- add aapt2 to bind()\n### Fixed\n- set `LD_LIBRARY_PATH` to avoid libraries access from termux i.e `$PREFIX/lib`\n\n## [1.4]\n### Added\n- patched binaries of aapt2 to skip invalid names while recompiling\n### Fixed\n- fixes #10\n\n## [1.5]\n### Changed\n- stick to alpine v3.10.2 instead of latest one\n\n## [1.6]\n### Added\n- custom path of framework directory\n- new flag `-V` to enable verbose mode for decompiling & recompiling only\n### Changed\n- update apktool to 2.4.1 \n- remove framework app __1.apk__ after each decompiling\n\n## [1.7]\n### Added\n- new option `--no-res` to decompile app except resources.\n- new option `--no-smali` to prevent disassembly of the dex file(s)\n\n## [1.8]\n### Added\n- new option `--no-assets` to prevent decoding of unknown assets files\n- `-z` for zipalign\n- `--frame-path` to specify framework directory\n- `-R` recompile + sign\n\n## [1.9]\n### Added\n- new option `--enable-perm` to enable all permissions automatically in binded or non binded payloads\n\n## [2.0]\n### Added\n- Kali support\n### Changed\n- remove option `-a` & defaults to `aapt2`\n\n## [2.1]\n### Added \n- jadx support\n- new option `--to-java` to decode [dex,apk,zip] to java sources\n- `--deobf` can use along with `--to-java`\n\n## [2.2]\n### Changed\n- now apksigner in termux is from sdk so a key ( PKCS12 ) is added.\n",
 "meta":
   {"repo_name": "Hax4us/Apkmod",
    "stars": "114",
    "repo_language": "Shell",
    "file_name": "jadx_termux.sh",
    "mime_type": "text/plain"}
}
UniverseFly commented 1 year ago

I have the same issue after inspecting the data downloaded from http://eaidata.bmk.sh/data/github_small.jsonl.zst. It seems the value of the 'file_name' key is identical for every repo.

osainz59 commented 8 months ago

This is a bug caused by https://github.com/EleutherAI/github-downloader/blob/345e7c4cbb9e0dc8a0615fd995a08bf9d73b3fe6/download_repo_text.py#L201C25-L201C49

They append the reference to the same dict every time, so, only the name and the type of the last file is stored in meta.