james-westwood commented 1 year ago

Pull Request submission

This is the first stage in getting our data onto GCP which, although it might be slower to load the data initially, it will negate the need for each user to have a complete copy of the data on their machine.

To keep the GCP account as secure as possible I have created a bucket which is not open to the public. I may be able to do this later but probably need to speak to somebody in cyber security (or similar) first. For now, there is a "service account" which allows reading and listing of the files in that particular bucket only. To load these credientials you'll need the key which is a json file, you you will need to put it in the secrets/ folder.

I will instruct the reviewer(s) how to get the json file separately.

I believe this PR meets the following requirements

Create a Google bucket
change the data source to the bucket- mount the bucket as drive
test that data loads

For point 2, I would say that I do not actually "mount the bucket as [a] drive". This deliverable was written when I didn't understand how to work with the bucket properly. In fact you can mount it, but it's not advisable and requires messing with settings at the OS level. Instead I am creating a gcp storage object in Python - which is the advised way of doing a similar thing.

The changes I have made here are:

Added a new class called GCPBucket which creates a storage object with the correct credentials and bucket name. This creates a connection to enable the functions/methods to work.
Added a new method called download_file to the GCPBucket class that downloads a file from the bucket.
Added a new method called generate_signed_url to the GCPBucket class that generates a signed URL for a file that is valid for 5 minutes.
Added a new method called get_file_list to the GCPBucket class that returns a list of all the files in the bucket as "blobs".
Added a new method called get_file_names to the GCPBucket class that prints and returns a list of all the file names in the bucket.

Closes or fixes

Closes #407

Code

[x] Requirements My/our code matches the requirements of the ticket
[x] Functionality: New functions meet requirements in issue ticket
[ ] Compliant Code Code is as PEP 8 compliant as I can humanly make it
[x] Code runs The code runs on my machine
[x] Clean Code
- [x] Code has been linted (use your favourite linter)
- [x] Code adheres to DRY

Documentation

Any new code includes all the following forms of documentation:

[x] Function Documentation Docstrings within the function(s') definition(s) have been created
- [ ] Includes parameters and returns for all major functions
- [ ] Includes data types
[ ] Updated Documentation: Working doc has been updated and a new ticket to update the master documentation has been created.

Data

[ ] All data needed to run this script has been added to Sync

Testing

[ ] Unit tests Unit tests have been created and are passing or a new ticket to create tests has been created

Peer Review Section

[ ] All requirements install from (updated) requirements.txt
[ ] Documentation has been created and is clear - check the google Doc
[ ] Doctrings (Google format) have been created and accurately describe the function's functionality
[ ] Unit tests pass, or if not present a new ticket to create tests has been created
[ ] Code runs The code runs on reviewer's machine

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction.

[ ] I recommend merging this request.

Review comments

Insert detailed comments here!

These might include, but not exclusively:

bugs that need fixing (does it work as expected? and does it work with other code that it is likely to interact with?)
alternative methods (could it be written more efficiently or with more clarity?)
documentation improvements (does the documentation reflect how the code actually works?)
additional tests that should be implemented (do the tests effectively assure that it works correctly?)
code style improvements (could the code be written more clearly?)

Your suggestions should be tailored to the code that you are reviewing. Be critical and clear, but not mean. Ask questions and set actions.

paigeh-fsa commented 11 months ago

Discussed with @jwestw the plan of action for the rest of the ticket. We would like to integrate using the GCP bucket created into our script by using a switch in config (either cloud or local) which means that we can decide to either use cloud or local data.

We will be using the function generate_signed_url() in GCP.

There are two courses of action for this: 1) We could create two modules (local_ingest.py, cloud_ingest.py) which will have the exact same function names but tailor to either the cloud or local machine. For example, we would use a URL to ingest data from the cloud, but a file path for a local machine.

2) The above option may mean we have WET code with a lot of duplication and minimal changes. We could, instead, write a function such as path_or_url() which means, depending on the config switch, we would create the file path or url for a given dataset. This would mean we wouldn't be repeating code.

I will be testing out both methods and tidying the functions in data_ingest.py .

paigeh-fsa commented 11 months ago

Have run into an error below:

There appears to be an error when we try to load the data from the cloud. Need to dig a bit deeper into why this is happening but leaving this for now.

paigeh-fsa commented 11 months ago

Have another issue with the SDG_scotland.py function:

Run main before SDG_scotland.py to hopefully resolve this issue.

paigeh-fsa commented 11 months ago

A lot of the work has now been completed - I have ticked the files below that I've completed:

[x] SDG_bus_timetable.py
[x] SDG_train_timetable.py
[x] SDG_ni.py
[x] SDG_scotland.py
[x] eng_wales_pre_process.py
[x] SDG_eng_wales.py

We currently have a few caveats with the issues I've mentioned above. So, for some reason, we can't run SDG_scotland.py because of an issue when we use from main import stops_geo_df. I'm not sure why this is, and not actually sure what this does. We are also currently importing the geo_df (screenshot above again) from local at the moment while we figure out why it isn't able to get the file from the google bucket. Aside from these two issues, all scripts will run with the data from the bucket :)

paigeh-fsa commented 10 months ago

The requirements of the ticket are now complete.

We have an end-to-end process where we can use either local on a local machine or cloud data from our GCP bucket.

Ticket is now ready for review.

jwestw commented 8 months ago

@paigeh1 and I have been reviewing this and it has gone quite well, as we have made a lot of fixes that allow the system to run with no local data present, entirely relying on the cloud-hosted data. We have successfully run:

SDG_bus_timetable.py
SDG_train_timetable.py
eng_wales_pre_process.py
SDG_eng_wales.py
SDG_scotland.py

However we are experiencing an error on SDG_northern_ireland.py

And this is what my local folder looks like (not sure if these files have just been downloaded, but I think they have)

paigeh-fsa commented 7 months ago

Next steps:

[ ] Double check SDG_northern_ireland.py runs
[ ] Resolve conflicts
[ ] @paigeh1 @jwestw run the whole pipeline again without data to see if cloud run works.

jwestw commented 7 months ago

I am re-running every pipeline after deleting not only the files but the folders in the data folder. This has created a lot of problems which I am solving.

Succesfully run:

SDG_train_timetable.py
SDG_bus_timetable.py
eng_wales_pre_process.py
SDG_eng_wales.py
SDG_scotland.py
SDG_northern_ireland.py

I have also made main.py into a runner which runs all pipelines in order.

Also, I had added a lot of improvements that create folders if they don't exist.

ONSdigital / SDG_11.2.1

407 create cloud space #415

Pull Request submission

Closes or fixes

Code

Documentation

Data

Testing

Peer Review Section

Final approval (post-review)

Review comments