Configure Google Cloud Storage

callahantiff commented 3 years ago

TASK

Task Type: PKT DATA DELIVERY

Use Google Cloud Storage in build to store each build's downloaded data and output knowledge graphs

Data used for each build
Built KGs
Built Embeddings

TODO

[x] Create GCS Bucket for builds in GCP PheKnowLator bucket (details here)
[x] Apply Object Lifecycle Management to GCS bucket
[x] ~~Create script similar to google_cloud_storage_downloader.py that can provides API access to Google Cloud Storage~~

Resources:

Python API Tutorials

callahantiff commented 3 years ago

This is what I am proposing for organizing the GCS bucket:

GCS bucket root/  
    |---- pheknowlator/
    |     |---- release_v.1.0/ ...
    |     |---- release_v.2.0/
    |     |     |---- *build_<<date>>/
    |     |     |     |---- data/
    |     |     |     |     |---- original_data/
    |     |     |     |     |---- processed_data/   
    |     |     |     |---- knowledge_graphs/  
    |     |     |     |     |---- subclass_builds/
    |     |     |     |     |     |---- relations_only/
    |     |     |     |     |     |     |---- owl/
    |     |     |     |     |     |     |---- owlnets/     
    |     |     |     |     |     |---- inverse_relations/
    |     |     |     |     |     |     |---- owl/
    |     |     |     |     |     |     |---- owlnets/     
    |     |     |     |     |---- instance_builds/
    |     |     |     |     |     |---- relations_only/
    |     |     |     |     |     |     |---- owl/
    |     |     |     |     |     |     |---- owlnets/     
    |     |     |     |     |     |---- inverse_relations/
    |     |     |     |     |     |     |---- owl/
    |     |     |     |     |     |     |---- owlnets/    
    |     |     |---- *build_<<date>>/ ...

For release_v.1.0 data, I will update it once this work is complete and add files from past builds so that I am no longer responsible for maintaining them via my DropBox.

*meant to symbolize each monthly build

GCS Permissions Setting: I was thinking of setting the bucket pheknowlator directory and all subsequent directories as nearline to start and once we know what the usage pattern is we can adjust it.

@bill-baumgartner - What do you think about this plan?

bill-baumgartner commented 3 years ago

I like the directory structure. Do you expect to host other data aside from KG builds here? If not, then the knowledge_graph_builds/ directory is probably not required.

It looks as though we can specify the storage class on a per-object level, and we can use Object Lifecycle Management rules to change the storage class over time. So, a newly built KG could use the Standard Storage class initially, and then be downgraded to Nearline Storage after a period of time, e.g. 30 or 60 days.

callahantiff commented 3 years ago

I like the directory structure. Do you expect to host other data aside from KG builds here? If not, then the knowledge_graph_builds/ directory is probably not required.

I was thinking about that too. I'm guessing not, if we wanted to store a primary docker container or something like that, it wold likely be at the release-level. I will modify the figure to remove the knowledge_graph_builds/ directory.

It looks as though we can specify the storage class on a per-object level, and we can use Object Lifecycle Management rules to change the storage class over time. So, a newly built KG could use the Standard Storage class initially, and then be downgraded to Nearline Storage after a period of time, e.g. 30 or 60 days.

Awesome, this is perfect.

✔️ I will go ahead and get this set-up now. It will allow me to start modifying/creating code we will need to support the three-task build plan we discussed yesterday.

callahantiff / PheKnowLator

Configure Google Cloud Storage #70

TASK

TODO