NYCPlanning / db-developments

🏠 🏘️ 🏗️ Developments Database
https://nycplanning.github.io/db-developments
8 stars 2 forks source link

Enhancement 22Q2: A2 Jobs #549

Closed td928 closed 1 year ago

td928 commented 2 years ago

Housing Team found that some A2 jobs can bring about units change. Goal is to include those A2 jobs in HDB along with the A1 jobs.

Outstanding question:

td928 commented 2 years ago

BIS filter:

Permit Type = A2 AND Job Desc. Contains "COMBIN*" AND Sprinkler is blank (new additional field to pull)

Additional Requirement:

NOW:

Housing needs to come up with logic. No work to be done for now.

SashaWeinstein commented 1 year ago

Sorry I know I've asked this before, so getting it in writing is probably necessary. What is the difference between an A1 and A2 job?

SashaWeinstein commented 1 year ago

The first place that A2 jobs are filtered out is actually on the upload to data library in this line where a SQL query is run on the socrata dataset. I'm don't like this design. First off all we have the same filter in sql/bis/_init.sql, which makes more sense conceptually. I think keeping data library as a store for "raw" information is more in line with our other pipelines. Wanted to get your opinion @td928 before I went forward with any changes

td928 commented 1 year ago

The first place that A2 jobs are filtered out is actually on the upload to data library in this line where a SQL query is run on the socrata dataset. I'm don't like this design. First off all we have the same filter in sql/bis/_init.sql, which makes more sense conceptually. I think keeping data library as a store for "raw" information is more in line with our other pipelines. Wanted to get your opinion @td928 before I went forward with any changes

lol my argument for it would be but it is so much fun! What else are we going to use our data library github action on if not for this kind of fancy footwork?

On a more serious note though, I do think it is probably saving us by reducing number of things we store on DO. It also affords us some control over some upstream source data without making change directly in data library. I am for keeping this design at least for now.

SashaWeinstein commented 1 year ago

Aren't you worried it's a big pattern break though? It's the only place I know of that we use data library templates to apply a filter, and it took me a while to find the filter there as I didn't think to look. Would be nice to save the next person time by not having to look there. We have the same filter in the devDB code itself too so it's not like we need the filter in the template. If we are worried ab DO space we can just delete the dozens of builds associated with feature branches that got merged in, this is just one dataset

SashaWeinstein commented 1 year ago

Like right now to make this improvement I would need to change the filter

          jobtype LIKE '%A1%' OR
          jobtype LIKE '%DM%' OR
          jobtype LIKE '%NB%'

in both the template and sql/bin/_init.sql. It's a good improvement to clean up the code this improvement only needs to be made in one place

td928 commented 1 year ago

I don't know if it is that bad to have it both places. For sure it increases the chances something might go wrong in two places. But I felt like it is helpful to think about the filter in build and in the ingest/geocode step. Also this might be one dataset but it is for sure a big one and job applications are pulled every week see here plus the geocoded files. Also we do use that create the row number which will only get filtered down to one record in the build. I would prefer if this could be accomplished all in the template step but failed to do so recalling this pr

SashaWeinstein commented 1 year ago

I think it's definitely bad to have it in both places? Isn't DRY one of our important principles? Having the filter in both places makes this upgrade harder to implement. I'm not sure I follow the point about the row numbers and the de-duping, but if there's a good reason to keep the logic in the template then I can proceed from there. I think a better solution here might be google big query because we are saving so much redundant data right how. What we really need is a streaming solution, google big query is one we already use. Does that seem right to you?

td928 commented 1 year ago

Yes. But this is also ONE upgrade harder to implement. Think about the big picture taking this kind of templates out of all repo leads to basically every change we make to the ingestion on data library means we will need to 1)update the docker image for data library 2) update the library archive github actions 3) then run github actions in the repo and check if the data coming out this entire process is updated. Sure we have all templates in data library repo and we keep data library image updated this way but I think we lost a lot of flexibilities and make the maintenance of the tool much bigger part of our work.

SashaWeinstein commented 1 year ago

wait I think we may have gotten signals crossed I'm not talking about moving the templates out of the pipeline repo right now. The more immediate issue is DRYing out the pipeline to make sure the filter for the correct sort of job only lives in one place. Although if we were going to keep A2 jobs in a separate table then maybe this line

AND jobtype ~* 'A1|DM|NB'

should stay where it is in sql/bin/_init.sql.

Are there A2 jobs in dob now as well as bis? If not then making the A2 table can happen in sql/bin/_init.sql right.