Closed nkshaw23 closed 1 year ago
@nkshaw23 Good work! I will try to pull this and run today
WIP: I will come back to this peer review form. Reviewing below
I've also ran this - there are a couple issues with the pipeline so far. We have managed to resolve some of them - @nkshaw23 is having a look at an issue where we need to resolve importing modules to other modules due to different directories. Currently, there is an issue with importing data_ingest
in data_transform
when we run the eng_wales_pre_process.py
because of different directories.
The pipeline also seems to take a long time to run. When I runmain.py
for Hastings, it took 364 seconds to run. I tried to run for a different local authority, Somerset West and Taunton, and this took 327 seconds. Maybe there's something that's happening twice in the background? Something worth looking into.
I've managed to make temp fixes to test the rest of the pipeline and it runs for me. However, I've found something interesting with the local authority 'Hastings'. There doesn't seem to be any information for rural population in Hastings. Maybe Hastings is completely rural and this is fine.
It also seems a bit strange that both the proportion of males and females are being served by public transport with the exact same percent. I ran for another local authority and the same issue occurs too.
This screenshot shows the output for local authority Hastings (with Urban/Rural and sex issue)
This screenshot shows the output for local authority Somerset West and Taunton with the sex issue
Thanks for all your testing @paigeh1! Whilst CDSW is down I have had a look into the above.
RUC: Surprisingly, turns out that Hastings
is all urban! See the orange polygons below, with the red outline the LA. But then looking on google maps, you can see why.
Tested with some other LAs and getting split %s between urban and rural. Ran the code through the debugger and everything looks okay to me.
RE Sex
There is something weird going on in your screenshots. The total population for Hastings
is almost 7 million! and actually, because you have such a large value here, the percentages for the sex are actually correct when rounded to 2 dp.
When i run it I get these overall population figures in the console, and then we can see this in the CSV and that the sex calculation looks to be correct, and the totals for each sex match the totals for overall population and served/unserved.
Same thing going on for Somerset and West Taunton - almost 8,000,000 people!
But on the back of looking at the sex, I did see an issue with the disability disaggregation. We never actually append these results, but just append sex again (line 452 below). This has been updated in the latest pushes.
Thanks for looking into this @nkshaw23 ! That's interesting, I'm not sure what is happening with my data and the population being so large - will have to look into it.
RE time taken
Mine seems to be taking roughly the correct amount of time (at least compared to the current way of running) - about 2 minutes. If you had the debugger on and working through the code, then be aware the time calculated at the end will take this into account. Other than that, might have just been a slow day?
Changes from @paigeh1 testing, and changes discussed above have been pushed.
Latest instructions for testing @Antonio-John @james-westwood
1) Delete everything in england_bus_timetable
directory (except .gitkeep
)
2) england_train_timetable
directory should look like this
3) Delete RUC11_OA11_EW.feather
from data
directory
4) Delete whole_nation_2019.feather
from data
directory (optional, as this takes a while to rebuild)
5) Make sure you have a eng_wales_preprocessed
directory.
6) Run SDG_bus_timetable
7) Run SDG_train_timetable
8) Run eng_wales_pre_process
9) Run main
Note that scotland and NI will now not run because we have changed main
. There are tickets out to update scotland and NI to match the template of England/Wales which I can do after this has been tested and merged.
I get an error when following these instructions. Just to make sure this wasn't due to my local machine being weird, I tried it on a Google VM too and I get the same error.
I have uploaded the entire output of the stack trace here: stacktrace.txt
pandas.errors.EmptyDataError: No columns to parse from file
is occuring when di._csv_to_df(...
is being called on line 122 of SDG_bus_timetable.py
So why is this happening? I am pretty sure I have seen this same error some months back.
After I deleted the data from data/england_bus_timetable
it all had to be replaced. The process of downloading and extracting that seems to go smoothly, but actually trips.txt
and calendar.txt
are empty (zero bytes). (stop_times.txt
is not however).
Checking the file sizes on the command line:
Then when di._csv_to_df()
calls pd.read_csv()
it fails.
I think this is an issue with the downloading and extracting of the bus data, rather than the code you've refactored here - and this is why I think I saw this months ago.
EDIT: ignore all of the above. I just ran this on the GCP machine again with more disk space. I think in both cases the problem might have come down to not enough space so the extraction process isn't working. Now this stage, running SDG_bus_timetable.py
is working well I think.
It's worth noting that when I run this in a fresh environment with no data in it, when I run eng_wales_pre_process.py
I get:
FileNotFoundError: [Errno 2] No such file or directory: 'data/england_bus_timetable/bus_highly_serviced_stops.feather'
I think that means that SDG_bus_timetable.py
was not successful after all.
Going back to to debug. In debug mode I get no errors. It seems to just run and complete.
However, in normal Python I get "killed" at the end of the process.
You might remember that from here: https://github.com/ONSdigital/SDG_11.2.1/pull/350#pullrequestreview-1220949363
Apparently you get "killed" when you run out of RAM. Just ran a linux diagnostic htop
and it shows this script maxes out the RAM. Will increase the RAM and run the script again.
I have upped the RAM on the GCP VM to 16GB.
The memory used by the script is over 8GB
This time bus_highly_serviced_stops.feather
was successfully written out.
It took me almost all day (I had data file missing since the script is slow to run, it takes a while to hit those points that error due to missing files) but eng_wales_pre_process.py
has finally run to completion. :clap::clap::clap:
environment.yml
<- I didn't use the requirement.txtThe author has responded to my review and made changes to my satisfaction.
I have checked through all the code and it looks good. I have created a new ticket #408 because rebuilding the whole-nation pop file took ridiculously long.
I don't think I can see anything in the WIP docs about these changes. I think we would need to instruct users on how to run this - i.e. what order to run the scripts.
Pull Request submission
Several large changes
naptan_df
import into train timetable script. This was imported frommain
, but we need to run train timetable beforemain
. Now loads directly into the train timetable script.create_tiploc
in preprocessing as it is only run for trains, but then we subset for trams so doesnt get used. Moved to train timetable as only applicable for trains.Station locations.csv
no longer needed so removed from code and from sync.Tested full pipeline (bus feather files, train feather file, whole nation pop file and stops feather all deleted) and compared to outputs in #352 . Advise testers to do the same (the whole nation pop takes a while so you can keep this if you want).
Run timetable scripts, then eng_wales processing module and then
main.py
to test england_walesCloses or fixes
Closes #380
Code