Simplifying job submission and manual parsing - Githubissues

Will-Shanks / LeedsCapstone

Capstone Team for Leeds Business School 2019

1 stars 2 forks source link

Simplifying job submission and manual parsing #35

Closed Will-Shanks closed 4 years ago

Will-Shanks commented 4 years ago

I got a bit carried away, so this PR does a few things.

Moves logic from submit script into python which is easier to understand, and faster - nav.py
Added sketch of how we could have getTitles.py find all the company names in a manual by switching based on the number of columns in the df generated by dayToDF.
- Just need to plug in code for two and three column pages and it'll be good to go
Moved the logic for one column page title extraction out of getTitles.py to oneCol.py
Simplifies parsing of manual. dayToDF.DayReader.lines yields lines from the manual until it reaches a page with a different number of columns, thus allowing the caller to not have to consider cases where text continues to the next page.
- Note: This helped uncovered a bug in dayToDF._get_cols() that is now in the issue tracker

Will-Shanks commented 4 years ago

The submission script still needs to be tested, but besides that is should be good to go

Will-Shanks commented 4 years ago

The submission script still needs to be tested, but besides that is should be good to go

Submission script works.

pep8speaks commented 4 years ago

Hello @Will-Shanks! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file src/dayToDF.py:

Line 66:9: E265 block comment should start with '# ' Line 69:80: E501 line too long (142 > 79 characters) Line 94:80: E501 line too long (100 > 79 characters)

Comment last updated at 2020-04-13 03:27:13 UTC

kevmonsta commented 4 years ago

To make my request clear I think every thing looks good I am just not sure about merging the daysToTitles.sh file as I am not sure we need it.

Im running the rest of the src code now to make sure I can get this to work

kevmonsta commented 4 years ago

On second thought I am going to approve the request and let will do the merge after he check over my comments.

kevmonsta commented 4 years ago

Alrght so I thin the daytoTitles.sh may be needed to run the code on summit.

:)

So I think we should be good to just merge the whole thing then :)

kevmonsta commented 4 years ago

Tried to run this on both my local device as well as Summit and got the sma eerror both times:

Sat Apr 11 09:07:56 MDT 2020 DEBUG:root:Creating DayReader for year 1930 DEBUG:root:finding files for year 1930, at brightness 70, with basedir /scratch/summit/diga9728/Moodys/Industrials/ DEBUG:root:Moving to next page, /scratch/summit/diga9728/Moodys/Industrials/OCRrun1930/000/OCRoutputIndustrial19300001-000170.day Traceback (most recent call last): File "/home/keea2562/csci4318/company_names/code/getTitles.py", line 65, in get_titles(*sys.argv[1:]) File "/home/keea2562/csci4318/company_names/code/getTitles.py", line 27, in get_titles dr = dayToDF.DayReader(year) File "/home/keea2562/csci4318/company_names/code/dayToDF.py", line 267, in init self._df = next(self._pages) # current page df File "/home/keea2562/csci4318/company_names/code/dayToDF.py", line 326, in _next_page yield get_df(p) File "/home/keea2562/csci4318/company_names/code/dayToDF.py", line 220, in get_df df = _get_lines(df) File "/home/keea2562/csci4318/company_names/code/dayToDF.py", line 159, in _get_lines for i in range(df['col'].max() + 1): TypeError: 'float' object cannot be interpreted as an integer Sat Apr 11 09:08:32 MDT 2020 ~

kevmonsta commented 4 years ago

After some more looking it appears that the issues is that in line 269 of the dayToDF file we are using the next comand to iterate over a df.

However if the df is empty it throws the exception seen above.

I am working on handling this.

Will-Shanks commented 4 years ago

I think we should add a comment at the top of each file stating what its dependencies are and what it does.

For instance If I just open the daytotitles file I have no clue what it is supposed to be used for.

Also can I get a description on what we mean by title? Is that the company name?

All of the python files already have a comment at the top that is supposed to describe what its code should do, which is followed by the import statements which are all of a files dependencies. I'll add a comment to the daysToTitles.sh sbatch script describing its purpose.

For the conf.py.

I think this just exists to help build some of the documentation.

But I am not sure.

Yes, everything in the docs dir is for sphinx generated documentation files.

Alright,

After looking through this here is my understanding.

We are merging the code in the src folder to the master branch.

This code has 5 files

dayToDF - draw - getTitles - nav - oneCol -

Correct, It updates a lot of the code

Each of these has a corresponding html doc describing what it does and how to use it.

Then there are a few other HTML files we are merging as well. genindex index modules pymodindex search

These all look like they exist to make the structure of the document database WIll made. No issues here everything looks good.

Then there is also the searchindex.js script that links it all together I believe.

Exactly, everything in docs/html is generated by sphinx when you run the buildDocs.sh script. Which is also true of everything in docs/rst besides docs/rst/index.rst (hence it being added to the repo and the rest being ignored by the docs/rst/.gitignore file

It also looks like we are merging in the daystotitles.sh script however I think that the functionality of this got replaced by getTitles.py script. SO maybe we dont need to merge this?

dayToTitles.sh is the sbatch script to submit a job to summit that runs getTitles.py. There used to be a lot of logic in dayToTitles.sh that found the .day files, this logic has been moved to nav.py, making it easier to understand, update, and faster.

After some more looking it appears that the issues is that in line 269 of the dayToDF file we are using the next comand to iterate over a df.

However if the df is empty it throws the exception seen above.

I am working on handling this.

I added some error handling for this, could you try running it again and let me know how it goes?