cagov / data-infrastructure

CalData infrastructure
https://cagov.github.io/data-infrastructure
MIT License
5 stars 0 forks source link

Dept list: domains #136

Open alannawil opened 1 year ago

alannawil commented 1 year ago

This one is a bit of a pain but was flagged as a should have, so people can both (1) find format how they should email people (2) find the website - which generally map but sometimes don't

Next steps: I can finish handcoding this so ready to merge in? Or do you have any flags that make you worried on this @ian-r-rose since it will require some ongoing monitoring?

ian-r-rose commented 1 year ago

Source: master list of domains from CDT, but no mapping;

Is this posted anywhere, or was it in a personal email? Unfortunately, I don't think there is a good way to do an exhaustive DNS listing for all ca.gov subdomains (though perhaps someone like @aaronhans has ideas?).

alannawil commented 1 year ago

There is a gated portal, which we can access - though I might need to ping them for a login, and I actually have a video of the convo where I got a walkthrough bc I knew I would not be able to really process it haha - here starting around 10 mins. (the first 10 mins is an overview of the backend of ca.gov)

alannawil commented 1 year ago

emailed CDT a couple days ago, they'll get back or will ping them if they don't by next week - they've been very collaborative so far (though its been a while since we touched in - my b :p)

alannawil commented 1 year ago

@ian-r-rose @britt-allen lmk if you wanna touch in on follow-ups here - i think we can see how bad the potential mismatch between organizations and domain in the current domain report is as a starting point? Seems like they are willing and able to integrate coding into this domains report but don't want to create a mess before validating this is the path we want to go down

the o365 path sounds perhaps more accurate but the pathway less clear .. (and you miss out on google orgs like ours, and unclear how many of those are at the state- though don't think it's a lot)

ian-r-rose commented 1 year ago

@ian-r-rose @britt-allen lmk if you wanna touch in on follow-ups here - i think we can see how bad the potential mismatch between organizations and domain in the current domain report is as a starting point? Seems like they are willing and able to integrate coding into this domains report but don't want to create a mess before validating this is the path we want to go down

I'm not sure how would join these, other than a very manual process with some possible fuzzy string matching (I believe you have tried this before?). But we can certainly give it a shot as a starting point!

the o365 path sounds perhaps more accurate but the pathway less clear .. (and you miss out on google orgs like ours, and unclear how many of those are at the state- though don't think it's a lot)

Yeah, I agree. I think your reticence to touch actual personnel directories is well-founded.

alannawil commented 1 year ago

Was thinking more of just a spot check because one of the issues he raised was that some organizations run other domains so even their internal organization name to domain mapping might be off. But I just looked through the reports and don't think this actually exists anywhere? Each report basically only has a domain and then a contact so i can see some of those examples like where someone at DCHS runs CHHS's domain - but no where does it even say in name that this is California Health and Human Services, unless I am missing a report - the all organizations is just a list.

alannawil commented 1 year ago

hmmm in terms of next steps here I would like to follow up with CDT team but still semi struggling here, think our key issues are

Think we need to work with CDT to figure out what makes most sense, they do seem willing to map domains and I think regardless of if they are email extensions that seems like a worthy effort/something that should be integrated into their processes regardless but want to make sure it's the right ask and we can help them and it's actually sustainable

I also think we do want what is in ca.gov to be mapped even if roughly to DOF codes so that we can leverage their API to feed in updates to them, but perhaps that's a separate issue tho w same folks

alannawil commented 1 year ago

CDT has some new motivation to work on this because of a legislative request -> the also gave us a draft list! It has some name (I think they used our DOF list values I fed to them but need to check), chatted with Art on 4/7 - rough notes are here

The conversation Alanna and Ian had after the convo:

  • The reason I keep on harping on the many-to-many relationships is that I think it will be important to emphasize in the specific asks we make (both of CDT and SCO). If we don’t do that, I think people will feel pressured to make a 1-1 mapping, and choose a “best fit”, rather than a more correct ontology. e.g., Art being interested in having a “primary contact” because he really cares about who to phone up when it comes to renewals
  • Thinking a bit more about it: it’s obviously many-to-many: Your example of state prisons is a good one for “many entities to one domain”. ODI is a fine example of many domains to one entity (we have digital.ca.gov, innovation.ca.gov, and also all of the CalInnovate projects: drought, covid etc)
  • ODI is an interesting example - like there might be a distinction between “this is the organizations domain” and “these are the domains that they own” . I think another good example is cdt itself - they have state.ca.gov and cdt.ca.gov

Note - the also spurred a conversation about cities/counties and other within CA government entities - I have followed up with Secretary of State to see if this is easy.

Next steps I think?

ian-r-rose commented 1 year ago

I'm jotting down some pseudo-code for reasoning about the relationship between domains and state entities. Because this is a many-to-many relationship, we are essentially going to have two main "noun" tables, then one bridge table per relationship between them. I'm going to roughly follow the syntax of the python SQLModel package, but it's still pseud-code, and shouldn't be expected to run :)

class StateEntity:
    business_unit_code: str = Field(primary_key=True)
    name: str
    level: Enum
    parent_entity: str = Field(foreign_key="stateentity.business_unit_code")
    domains: List[WebDomain] = Relationship(link_model=StateEntityDomainOwnership)
    email_representation: List[WebDomain] = Relationship(link_model=DomainStateEntityEmailRepresentation)
    website_representation: List[WebDomain] = Relationship(link_model=DomainStateEntityWebsiteRepresentation)

class WebDomain:
    domain: str = Field(primary_key=True)
    contact_email: Optional[str]
    active: bool
    owner: StateEntity = Relationship(back_populates="domains", link_model=StateEntityDomainOwnership)

class StateEntityDomainOwnership:
    business_unit_code: str = Field(foreign_key="stateentity.business_unit_code")
    domain: str = Field(foreign_key="webdomain.domain")
    valid_from: datetime
    valid_to: datetime

class DomainStateEntityEmailRepresentation:
    business_unit_code: str = Field(foreign_key="stateentity.business_unit_code")
    domain: str = Field(foreign_key="webdomain.domain")
    valid_from: datetime
    valid_to: datetime    

class DomainStateEntityWebsiteRepresentation:
    business_unit_code: str = Field(foreign_key="stateentity.business_unit_code")
    domain: str = Field(foreign_key="webdomain.domain")
    valid_from: datetime
    valid_to: datetime
alannawil commented 1 year ago

@ian-r-rose dropped some notes on the first two pages here, as the start of a potential collaborative doc w Art - feel free to drop thoughts or let me know if it's be helpful to touch in quickly

ian-r-rose commented 1 year ago

Your breakdown of follow-up tasks looks good to me @alannawil, I added a few minor comments and edits, but I think it's a good work plan.

It sounds to me like the next things we could propose are (this is just recapping discussions we've already had, just trying to be explicit):

  1. Ask CDT to add new fields for BU code to both the state entity website and the domains site. It's okay if they are mostly empty for the moment, we (or an intern) can get it started.
  2. Produce a draft list for CDT based on DOF's draft excel, host it (unlisted) on open data. After our initial population of BU codes in the state entity profile we can ask questions like:
    • Which entities are missing a profile?
    • Which profiles don't seem to correspond to an "official" entity (according to DOF)?

Hopefully if we have some compelling answers to the above that can be a carrot to get CDT to engage further.

alannawil commented 1 year ago

nice thanks! Edited and updated with these steps and shared with Art - if don't see him pop in there will send him a note in a day or two!

alannawil commented 1 year ago

ah - moved it over to this doc too - https://docs.google.com/document/d/1Aovh9QXTM6OdHs-1Kyt3As7a1-VPtu6Ip7a8uIVwwso/edit?usp=sharing