cal-itp / data-analyses

Place for sharing quick reports, and works in progress
https://analysis.calitp.org
26 stars 5 forks source link

Research Task - fix rt intermediate files #536

Closed edasmalchi closed 1 year ago

edasmalchi commented 1 year ago

Research Task

Attempted to use rt_filter_map_plot.from_gcs() and realized that while the rt_trips files were overwritten in #531, stop_delay_views and segment_speed_views weren't yet.

Tried using the script Tiffany came up with for those folders, but realized too late that since those are geoparquets I should have swapped pd for gpd and things are a little broken now.

Luckily we have versioning enabled, working now to temporarily restore the old versions in those two folders and re-do the rename script using geopandas so they remain geospatial data.

edasmalchi commented 1 year ago

done for stop_delay_views, segment_speed_views will regenerate automatically as needed

code for reference in case of future jams:

# Shell command to log all filenames and generation codes

# gsutil ls -a gs://calitp-analytics-data/data-analyses/rt_delay/stop_delay_views >> sdv.log
# Remove lines from top of file that don't have an actual filename as well as blank newline at bottom

with open('sdv.log') as file:
    lines = [line.rstrip() for line in file]

storage_client = storage.Client(project='cal-itp-data-infra')
from google.cloud import storage
​
​
def copy_file_archived_generation(
        bucket_name, blob_name, destination_bucket_name, destination_blob_name, generation
):
    """Copies a blob from one bucket to another with a new name with the same generation."""
​
    source_bucket = storage_client.bucket(bucket_name)
    source_blob = source_bucket.blob(blob_name)
    destination_bucket = storage_client.bucket(destination_bucket_name)
​
    blob_copy = source_bucket.copy_blob(
        source_blob, destination_bucket, destination_blob_name, source_generation=generation
    )
​
    print(
        "Generation {} of the blob {} in bucket {} copied to blob {} in bucket {}.".format(
            source_blob.generation,
            source_blob.name,
            source_bucket.name,
            blob_copy.name,
            destination_bucket.name,
        )
    )
​
for line in lines:
    blob_gen = '/'.join(line.split('/')[3:])
    blob = blob_gen.split('#')[0]
    gen = int(blob_gen.split('#')[1])

    if blob.split('/')[-1].split('_')[1][:4] != '2022': # restore old naming convention files
        print(blob + ' restore')
        copy_file_archived_generation('calitp-analytics-data', blob, 'calitp-analytics-data', blob, gen)
        time.sleep(1)

    elif blob.split('/')[-1].split('_')[1][:4] == '2022': # remove already renamed but misformatted files
        print(blob + ' rm')
        fs.rm(f'gs://calitp-analytics-data/{blob}')
        time.sleep(.5)