WSWCWaterDataExchange / WaDE-Side-Projects

A repository of short queries and projects to answer basic questions of WaDE and WestDAAT.
0 stars 0 forks source link

Classify water users owners receiving water from Reclamation #2

Open amabdallah opened 1 year ago

amabdallah commented 1 year ago

There is a way to classify water users owners receiving water from Reclamation based on matching water rights owners in WaDE and Reclamation Project users in their database here https://www.usbr.gov/projects/

I was able to manually find that these three water owners in WaDE are listed under the Strawberry Valley Project https://westdaat.westernstateswater.org/details/right/UTwr_WR52679 https://westdaat.westernstateswater.org/details/right/UTwr_WR68320 https://westdaat.westernstateswater.org/details/right/UTwr_WR65783

Pending Reclamation giving us a machine-readable format of their projects database (I just asked them), let's plan on adding another Python code that tries to find a match for all Reclamation water users in the WaDE database water rights owners. If there is a match, we classify the right water owner in WaDE as Reclamation Water Reciever.

I think the classification needs to be different from the regular Reclamation name. In the end, users can select one or two classifications and get either or both data. That would be super cool!

amabdallah commented 1 year ago

So Reclamation does not have a machine-readable copy of this database :( So we need to scrape the Organization under Contacts listed for each dam. I see that they have 585 dams accessible through this ~"API" Example for the Strawberry Valley Project, write, Strawberry Water Users, Springville Irrigation District etc. For now, we just need their names. https://www.usbr.gov/projects/index.php?id=419

For the sake of loving to try ChatGPT, here is a place to start. I think it may need tweaking to get the the right HTML tags and format the result into a csv file

Write me python code that iterates over this website id from 1 to 585 https://www.usbr.gov/projects/index.php?id=465 and reads the Organizations under Contact header for each Dam. Print Dam name and its Organizations Answer Sure! Here's the Python code to iterate over the website IDs from 1 to 585, scrape the Organizations under Contact header for each dam, and print the dam name and its organizations. We'll be using the requests and BeautifulSoup libraries in Python to accomplish this task.

import requests
from bs4 import BeautifulSoup

for i in range(1, 586):
    url = f"https://www.usbr.gov/projects/index.php?id={i}"
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        contacts_div = soup.find('div', {'class': 'contact-info'})

        if contacts_div:
            contacts_header = contacts_div.find('h3', text='Contact')
            if contacts_header:
                orgs_list = contacts_header.find_next_sibling('ul')
                if orgs_list:
                    orgs = orgs_list.find_all('li')
                    org_names = [org.text for org in orgs]
                    print(f"Dam {i} Organizations: {org_names}")
                else:
                    print(f"No Organizations found for Dam {i}")
            else:
                print(f"No Contact header found for Dam {i}")
        else:
            print(f"No Contact info found for Dam {i}")
    else:
        print(f"Failed to retrieve data for Dam {i}. Status code: {response.status_code}")

The code above iterates over the website IDs from 1 to 585 using a for loop. For each ID, it constructs the URL and makes a request to it using the requests library. Then, it uses the BeautifulSoup library to parse the HTML content of the response.

Next, the code extracts the dam name by finding the h1 tag with class page-title and using the text attribute to get the text content. It then extracts the organizations by finding all li tags within the second div tag with class panel-body. It uses a list comprehension to extract the text content of each li tag and store it in a list.

Finally, the code prints the dam name and its organizations using print() statements. It also adds a line of dashes as a separator between each dam for clarity.

amabdallah commented 1 year ago

These links are relevant as a reference https://www.usgs.gov/centers/fort-collins-science-center/science/us-bureau-reclamation

Western United States showing location of Bureau of Reclamation projects : (projects completed and under construction) https://www.loc.gov/resource/g4051c.ct011656/?r=-0.042,0.381,0.908,0.554,0

rwjam commented 1 year ago

I was able to web scrap the owner name and state of USBR contractors from that USBR website and save the results in a csv. However, I’m not sure how well of a job we can use these to match to existing WADE data. When I tried to do an exact match I was only able to get 55 records, and when I did a partial match it was only 63 records.

Further inspecting the data by hand I caught a few instances that will be hard to match.

AllocationOwner (in WaDE) Contractor Name (from USBR) Issue
BITTER ROOT IRRIGATION DISTRICT (MT) Bitter Root Project Irrigation District (MT) the inclusion of the Project in contractor name prevented exact or partial match from working
BRIDGER VALLEY WATER CONSERVANCY DISTRICT (UT) BRIDGER VALLEY WATER CONSERVANCY DISTRICT (WY), BRIDGER VALLEY WATER CONSERVANCY DISTRICT (MT) The allocation owner name in WaDE is noted in UT, but the contractor names from USBR are noted in WY & MT

Tried doing a partial match both ways (checking AllocationOwner -in- Contractor Names, and again with checking Contractor Names -in- AllocationOwner) with similar poor results.

amabdallah commented 1 year ago

So the good news is that after removing any name that includes "Office" (i.e., Reclamation office related to the project), there are 201 contractors left. The exact match of 55 is a big first step but it needs match/verification to make sure Case#2 doesn't happen. For the rest 150 contractors, we need more manual matching informed by the address/location.

I think this is a good update for this quarter to WaterSmart. During the next quarter, we'll find a way to finish the rest. Perhaps we hire an intern.

amabdallah commented 1 year ago

@rwjam

When you have some time, can you add this header (see screenshot) as another column to the extracted data? I want it here too https://docs.google.com/spreadsheets/d/14kVRLyPtz-AhDk3i2QfyMEFPEjfecYbeBFM15jr63Qc/edit?usp=sharing

image