CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Intermediate Data Storage Format #13

Open ncoop57 opened 1 year ago

ncoop57 commented 1 year ago

Questions:

Let's use this issue to discuss this topic.

Resources:

taisazero commented 1 year ago

For datasets that we need to process such as filtering, dedup, and near dedup before using lm_format. Using arrow or a similar intermediary format like parquet would allow the process to be faster and more efficient. Are there any datasets that do not need to be processed/filtered?

ncoop57 commented 1 year ago

Ones that are probably already included from a paper or on huggingface we wouldn't need to process

flowpoint commented 1 year ago

Parquet through pyarrow would be quite good. I want to use it between processing steps also anway.

vanga commented 1 year ago

Converting the xml and storing them in parquet would be pretty useful and efficient. parquet offers much better data compression and stores data in column format, so its faster to load as pandas objects and do other computations, joins etc on top of it.

As we iterate multiple times on different possibilities of post processing, being able to load fast would be useful. Example difference in sizes for a site image

JSON format, other than being readable if one wants to inspect files manually offers no advantage.