SE data dump processor - Githubissues

CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.

MIT License

105 stars 29 forks source link

SE data dump processor #25

Closed vanga closed 2 years ago

vanga commented 2 years ago

For #1

Convert xml to json/parquet and stores them in temp_dir (parquet is preferred)
Column names are modified to remove "@" at the beginning for convenience
Transforms "Tags" column of posts table into a list
Uses pandas "merge", "groupby" and "apply" features do one-one and one-many joins. one-many relationship leads to a nested list of dicts.
Currently stores the output in temp_dir, we can add another arg for StackExchangeDataset to take outpu_dir also as input
Does unzipping and xml-parquet conversion only if necessary, skips if the files are present in temp_dir
Only limited relations and columns are being included right now.

Created a new PR based on "working" branch (Old PR for reference)

vanga commented 2 years ago

@reshinthadithyan I missed reading about using "working" branch in the README. But I am not sure if forking non-main branch is the general practice, I could be wrong though. Github also seems to provide a default option to only fork main branch.

I think the way ppl generally do is to have something like "stable"/"release" branch and have everyone branch out/fork from main branch. You merge the main branch to "release" for deployment sake or to provide a view of stable code incase if we are worried that main branch could be unstable.

vanga commented 2 years ago

@flowpoint Thanks for pointing out about pyproject.toml, I didn't know about it

vanga commented 2 years ago

Made some more optimizations to skip various processing steps if the target file already exists (unzipping, xml-parquet conversion, denormalization). I am also extracting and converting only selected tables (Posts, Users, Comments)

ncoop57 commented 2 years ago

@vanga yeah main is for stable releases, but as of right now we have no stable release yet, so we are wanting to merge things quickly from PRs into working. Merge conflicts to be expected. Just a quick question, in terms of the stackexchange script, is there in a better state than #15 from @flowpoint? If so, we can merge #15 and then override the SE stuff with this PR.

ncoop57 commented 2 years ago

@vanga Any update on this PR? Could you update it to include the new changes?

vanga commented 2 years ago

Hi,

My implementation tries to load and convert the whole xml at once which is too slow. It was taking more than an hour to convert stackoverflow.com Posts.xml (largest site of all the dumps). Memory usage is also quite high at 400+ GB RAM usage..

@flowpoint has been working at a performant way to do this using a different approach..Idea was to go with that instead of my approach which is also using pandas to do JOINs. But he seems to be running into some other challenges.

So, I just tried joining using pyspark today and have a working example of JOINing with nested list of objects (posts.comments: [...]) We may possibly mix both of our approaches where we use the xml-parquet conversion using his approach and using pyspark to do the joins.

ncoop57 commented 2 years ago

@vanga heard things are working now? Is this PR in a good place to review?

vanga commented 2 years ago

@ncoop57 this is good to be reviewed. Thanks.

flowpoint commented 2 years ago

I looked over it and there are possible improvements, but i think we can merge this @ncoop57 @reshinthadithyan