EleutherAI / the-pile

MIT License
1.48k stars 129 forks source link

pass2_shuffle_holdout.py - ModuleNotFoundError: No module named 'parse' #110

Closed dboggs95 closed 1 year ago

dboggs95 commented 1 year ago

Goal

I'm trying to replicate a subset of the Pile that works with the GPT-NeoX trainer. I have pretty good hardware, but nothing like the 90 Tesla rack that made the 20 billion parameter GPT-NeoX-20 model. So, I'm trying to keep everything simple while I learn how this process works.

Environment and Setup

I am using Python 3.8 + pip on WSL2 + Ubuntu 22.04. Since the project requires Python 3.6 or above, I figured that should be fine.

I have run the setup.py and installed the project, so all the requirements declared by the project are there.

I also commented out every dataset except for Gutenberg, because like said, I'm trying to keep everything simple.

Problem

I downloaded the full Gutenberg dataset and ran the first pass shuffle on it via pile.py.

For the second pass, I ran pass2_shuffle_holdout.py, but it fails on the error below. The readme mentions some of the scripts may be obsolete, but doesn't specifically say which scripts. This doesn't look like an obsolete script, and several other scripts import the parse module. If I read everything correctly, I need two shuffle passes to do this properly, so I'm 99% sure this script is supposed to work.

Error Message

\Traceback (most recent call last):
  File "./processing_scripts/pass2_shuffle_holdout.py", line 8, in <module>
    import parse
ModuleNotFoundError: No module named 'parse'

Research

I looked in the repo for a parse module and did not find it. I looked online for a parse module for Python 3, and I couldn't any evidence one exists.

I searched the web, Stackoverflow, and the Git issues for a solution.

dboggs95 commented 1 year ago

I just figured out the answer to my own question.

There is a python library called parse: https://pypi.org/project/parse/

I just had to run this command to install it:

pip install parse

Right after that, I noticed it couldn't create the directories it needed to run, so I manually added pile_output and pile_holdout folders to the project root.