ONSdigital / SDG_11.2.1

Analysis for the UN Sustainable Development Goal 11.2.1
https://onsdigital.github.io/SDG_11.2.1/
Apache License 2.0
5 stars 7 forks source link

407 create cloud space #415

Closed james-westwood closed 6 months ago

james-westwood commented 1 year ago

Pull Request submission

This is the first stage in getting our data onto GCP which, although it might be slower to load the data initially, it will negate the need for each user to have a complete copy of the data on their machine.

To keep the GCP account as secure as possible I have created a bucket which is not open to the public. I may be able to do this later but probably need to speak to somebody in cyber security (or similar) first. For now, there is a "service account" which allows reading and listing of the files in that particular bucket only. To load these credientials you'll need the key which is a json file, you you will need to put it in the secrets/ folder.

I will instruct the reviewer(s) how to get the json file separately.

I believe this PR meets the following requirements

  1. Create a Google bucket
  2. change the data source to the bucket- mount the bucket as drive
  3. test that data loads

For point 2, I would say that I do not actually "mount the bucket as [a] drive". This deliverable was written when I didn't understand how to work with the bucket properly. In fact you can mount it, but it's not advisable and requires messing with settings at the OS level. Instead I am creating a gcp storage object in Python - which is the advised way of doing a similar thing.

The changes I have made here are:

Closes or fixes

Closes #407

Code

Documentation

Any new code includes all the following forms of documentation:

Data

Testing


Peer Review Section

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction.


Review comments

Insert detailed comments here!

These might include, but not exclusively:

Your suggestions should be tailored to the code that you are reviewing. Be critical and clear, but not mean. Ask questions and set actions.

paigeh-fsa commented 11 months ago

Discussed with @jwestw the plan of action for the rest of the ticket. We would like to integrate using the GCP bucket created into our script by using a switch in config (either cloud or local) which means that we can decide to either use cloud or local data.

We will be using the function generate_signed_url() in GCP.

There are two courses of action for this: 1) We could create two modules (local_ingest.py, cloud_ingest.py) which will have the exact same function names but tailor to either the cloud or local machine. For example, we would use a URL to ingest data from the cloud, but a file path for a local machine.

2) The above option may mean we have WET code with a lot of duplication and minimal changes. We could, instead, write a function such as path_or_url() which means, depending on the config switch, we would create the file path or url for a given dataset. This would mean we wouldn't be repeating code.

I will be testing out both methods and tidying the functions in data_ingest.py .

paigeh-fsa commented 11 months ago

Have run into an error below:

Screenshot 2023-08-11 at 14 20 00

There appears to be an error when we try to load the data from the cloud. Need to dig a bit deeper into why this is happening but leaving this for now.

paigeh-fsa commented 11 months ago

Have another issue with the SDG_scotland.py function:

Screenshot 2023-08-11 at 15 36 10

Run main before SDG_scotland.py to hopefully resolve this issue.

paigeh-fsa commented 11 months ago

A lot of the work has now been completed - I have ticked the files below that I've completed:

We currently have a few caveats with the issues I've mentioned above. So, for some reason, we can't run SDG_scotland.py because of an issue when we use from main import stops_geo_df. I'm not sure why this is, and not actually sure what this does. We are also currently importing the geo_df (screenshot above again) from local at the moment while we figure out why it isn't able to get the file from the google bucket. Aside from these two issues, all scripts will run with the data from the bucket :)

paigeh-fsa commented 10 months ago

The requirements of the ticket are now complete.

We have an end-to-end process where we can use either local on a local machine or cloud data from our GCP bucket.

Ticket is now ready for review.

jwestw commented 8 months ago

@paigeh1 and I have been reviewing this and it has gone quite well, as we have made a lot of fixes that allow the system to run with no local data present, entirely relying on the cloud-hosted data. We have successfully run:

However we are experiencing an error on SDG_northern_ireland.py

image

And this is what my local folder looks like (not sure if these files have just been downloaded, but I think they have)

image

paigeh-fsa commented 7 months ago

Next steps:

jwestw commented 7 months ago

I am re-running every pipeline after deleting not only the files but the folders in the data folder. This has created a lot of problems which I am solving.

Succesfully run:

I have also made main.py into a runner which runs all pipelines in order.

Also, I had added a lot of improvements that create folders if they don't exist.