nkshaw23 commented 1 year ago

Pull Request submission

Several large changes

All preprocessing steps have been taken out of the pipeline and collected in one place.
Preprocessing outputs saved as feather (non-spatial) / geojson (spatial) files for the main pipeline
Main pipelines updated from (1) and (2) accordingly, restructures for improved readability and cleaned.
Code added to allow pipeline to be run for various countries
Variable names and scripts and config made consistent
Removed naptan_df import into train timetable script. This was imported from main, but we need to run train timetable before main. Now loads directly into the train timetable script.
Removed create_tiploc in preprocessing as it is only run for trains, but then we subset for trams so doesnt get used. Moved to train timetable as only applicable for trains.
Station locations.csv no longer needed so removed from code and from sync.
Further tidy (removing duplicate functions + linting)

Tested full pipeline (bus feather files, train feather file, whole nation pop file and stops feather all deleted) and compared to outputs in #352 . Advise testers to do the same (the whole nation pop takes a while so you can keep this if you want).

Run timetable scripts, then eng_wales processing module and then main.py to test england_wales

Closes or fixes

Closes #380

Code

[x] Requirements My/our code matches the requirements of the ticket
[x] Functionality: New functions meet requirements in issue ticket
[x] Compliant Code Code is as PEP 8 compliant as I can humanly make it
[x] Code runs The code runs on my machine
[x] Clean Code

jwestw commented 1 year ago

@nkshaw23 Good work! I will try to pull this and run today

jwestw commented 1 year ago

WIP: I will come back to this peer review form. Reviewing below

paigeh-fsa commented 1 year ago

I've also ran this - there are a couple issues with the pipeline so far. We have managed to resolve some of them - @nkshaw23 is having a look at an issue where we need to resolve importing modules to other modules due to different directories. Currently, there is an issue with importing data_ingest in data_transform when we run the eng_wales_pre_process.py because of different directories.

The pipeline also seems to take a long time to run. When I runmain.py for Hastings, it took 364 seconds to run. I tried to run for a different local authority, Somerset West and Taunton, and this took 327 seconds. Maybe there's something that's happening twice in the background? Something worth looking into.

I've managed to make temp fixes to test the rest of the pipeline and it runs for me. However, I've found something interesting with the local authority 'Hastings'. There doesn't seem to be any information for rural population in Hastings. Maybe Hastings is completely rural and this is fine.

It also seems a bit strange that both the proportion of males and females are being served by public transport with the exact same percent. I ran for another local authority and the same issue occurs too.

This screenshot shows the output for local authority Hastings (with Urban/Rural and sex issue)

This screenshot shows the output for local authority Somerset West and Taunton with the sex issue

nkshaw23 commented 1 year ago

Thanks for all your testing @paigeh1! Whilst CDSW is down I have had a look into the above.

RUC: Surprisingly, turns out that Hastings is all urban! See the orange polygons below, with the red outline the LA. But then looking on google maps, you can see why.

Tested with some other LAs and getting split %s between urban and rural. Ran the code through the debugger and everything looks okay to me.

nkshaw23 commented 1 year ago

RE Sex

There is something weird going on in your screenshots. The total population for Hastings is almost 7 million! and actually, because you have such a large value here, the percentages for the sex are actually correct when rounded to 2 dp.

When i run it I get these overall population figures in the console, and then we can see this in the CSV and that the sex calculation looks to be correct, and the totals for each sex match the totals for overall population and served/unserved.

Same thing going on for Somerset and West Taunton - almost 8,000,000 people!

nkshaw23 commented 1 year ago

But on the back of looking at the sex, I did see an issue with the disability disaggregation. We never actually append these results, but just append sex again (line 452 below). This has been updated in the latest pushes.

paigeh-fsa commented 1 year ago

Thanks for looking into this @nkshaw23 ! That's interesting, I'm not sure what is happening with my data and the population being so large - will have to look into it.

nkshaw23 commented 1 year ago

RE time taken

Mine seems to be taking roughly the correct amount of time (at least compared to the current way of running) - about 2 minutes. If you had the debugger on and working through the code, then be aware the time calculated at the end will take this into account. Other than that, might have just been a slow day?

nkshaw23 commented 1 year ago

Changes from @paigeh1 testing, and changes discussed above have been pushed.

Latest instructions for testing @Antonio-John @james-westwood

1) Delete everything in england_bus_timetable directory (except .gitkeep) 2) england_train_timetable directory should look like this

3) Delete RUC11_OA11_EW.feather from data directory 4) Delete whole_nation_2019.feather from data directory (optional, as this takes a while to rebuild) 5) Make sure you have a eng_wales_preprocessed directory. 6) Run SDG_bus_timetable 7) Run SDG_train_timetable 8) Run eng_wales_pre_process 9) Run main

Note that scotland and NI will now not run because we have changed main. There are tickets out to update scotland and NI to match the template of England/Wales which I can do after this has been tested and merged.

jwestw commented 1 year ago

EDIT - this is now solved. So just keeping the post for reference.

I get an error when following these instructions. Just to make sure this wasn't due to my local machine being weird, I tried it on a Google VM too and I get the same error.

I have uploaded the entire output of the stack trace here: stacktrace.txt

pandas.errors.EmptyDataError: No columns to parse from file is occuring when di._csv_to_df(... is being called on line 122 of SDG_bus_timetable.py

So why is this happening? I am pretty sure I have seen this same error some months back.

After I deleted the data from data/england_bus_timetable it all had to be replaced. The process of downloading and extracting that seems to go smoothly, but actually trips.txt and calendar.txt are empty (zero bytes). (stop_times.txt is not however).

Checking the file sizes on the command line:

Then when di._csv_to_df() calls pd.read_csv() it fails.

I think this is an issue with the downloading and extracting of the bus data, rather than the code you've refactored here - and this is why I think I saw this months ago.

EDIT: ignore all of the above. I just ran this on the GCP machine again with more disk space. I think in both cases the problem might have come down to not enough space so the extraction process isn't working. Now this stage, running SDG_bus_timetable.py is working well I think.

jwestw commented 1 year ago

It's worth noting that when I run this in a fresh environment with no data in it, when I run eng_wales_pre_process.py

I get:

FileNotFoundError: [Errno 2] No such file or directory: 'data/england_bus_timetable/bus_highly_serviced_stops.feather'

I think that means that SDG_bus_timetable.py was not successful after all.

jwestw commented 1 year ago

Going back to to debug. In debug mode I get no errors. It seems to just run and complete.

However, in normal Python I get "killed" at the end of the process.

You might remember that from here: https://github.com/ONSdigital/SDG_11.2.1/pull/350#pullrequestreview-1220949363

Apparently you get "killed" when you run out of RAM. Just ran a linux diagnostic htop and it shows this script maxes out the RAM. Will increase the RAM and run the script again.

jwestw commented 1 year ago

I have upped the RAM on the GCP VM to 16GB.

The memory used by the script is over 8GB

This time bus_highly_serviced_stops.feather was successfully written out.

jwestw commented 1 year ago

It took me almost all day (I had data file missing since the script is slow to run, it takes a while to hit those points that error due to missing files) but eng_wales_pre_process.py has finally run to completion. :clap::clap::clap:

jwestw commented 1 year ago

Peer Review Section

[x] All requirements install from (updated) environment.yml <- I didn't use the requirement.txt
[ ] Documentation has been created and is clear - check the google Doc
[x] Doctrings (Google format) have been created and accurately describe the function's functionality
[x] Unit tests pass, or if not present a new ticket to create tests has been created
[x] Code runs The code runs on reviewer's machine

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction.

[x] I recommend merging this request.

Review comments

I have checked through all the code and it looks good. I have created a new ticket #408 because rebuilding the whole-nation pop file took ridiculously long.

We should either
- Optimise the scripts so they use less memory
- Put some minimum system requirements on running the script. Running out of memory was quite confusing because "killed" isn't very explicit

I don't think I can see anything in the WIP docs about these changes. I think we would need to instruct users on how to run this - i.e. what order to run the scripts.

ONSdigital / SDG_11.2.1

Extracting out pre processing for england and wales. #385

Pull Request submission

Closes or fixes

Code

EDIT - this is now solved. So just keeping the post for reference.

Peer Review Section

Final approval (post-review)

Review comments