Closed td928 closed 2 years ago
BIS filter:
Permit Type = A2 AND Job Desc. Contains "COMBIN*" AND Sprinkler is blank (new additional field to pull)
Additional Requirement:
NOW:
Housing needs to come up with logic. No work to be done for now.
Sorry I know I've asked this before, so getting it in writing is probably necessary. What is the difference between an A1 and A2 job?
The first place that A2 jobs are filtered out is actually on the upload to data library in this line where a SQL query is run on the socrata dataset. I'm don't like this design. First off all we have the same filter in sql/bis/_init.sql, which makes more sense conceptually. I think keeping data library as a store for "raw" information is more in line with our other pipelines. Wanted to get your opinion @td928 before I went forward with any changes
The first place that A2 jobs are filtered out is actually on the upload to data library in this line where a SQL query is run on the socrata dataset. I'm don't like this design. First off all we have the same filter in sql/bis/_init.sql, which makes more sense conceptually. I think keeping data library as a store for "raw" information is more in line with our other pipelines. Wanted to get your opinion @td928 before I went forward with any changes
lol my argument for it would be but it is so much fun! What else are we going to use our data library github action on if not for this kind of fancy footwork?
On a more serious note though, I do think it is probably saving us by reducing number of things we store on DO. It also affords us some control over some upstream source data without making change directly in data library. I am for keeping this design at least for now.
Aren't you worried it's a big pattern break though? It's the only place I know of that we use data library templates to apply a filter, and it took me a while to find the filter there as I didn't think to look. Would be nice to save the next person time by not having to look there. We have the same filter in the devDB code itself too so it's not like we need the filter in the template. If we are worried ab DO space we can just delete the dozens of builds associated with feature branches that got merged in, this is just one dataset
Like right now to make this improvement I would need to change the filter
jobtype LIKE '%A1%' OR
jobtype LIKE '%DM%' OR
jobtype LIKE '%NB%'
in both the template and sql/bin/_init.sql
. It's a good improvement to clean up the code this improvement only needs to be made in one place
I don't know if it is that bad to have it both places. For sure it increases the chances something might go wrong in two places. But I felt like it is helpful to think about the filter in build and in the ingest/geocode step. Also this might be one dataset but it is for sure a big one and job applications are pulled every week see here plus the geocoded files. Also we do use that create the row number which will only get filtered down to one record in the build. I would prefer if this could be accomplished all in the template step but failed to do so recalling this pr
I think it's definitely bad to have it in both places? Isn't DRY one of our important principles? Having the filter in both places makes this upgrade harder to implement. I'm not sure I follow the point about the row numbers and the de-duping, but if there's a good reason to keep the logic in the template then I can proceed from there. I think a better solution here might be google big query because we are saving so much redundant data right how. What we really need is a streaming solution, google big query is one we already use. Does that seem right to you?
Yes. But this is also ONE upgrade harder to implement. Think about the big picture taking this kind of templates out of all repo leads to basically every change we make to the ingestion on data library means we will need to 1)update the docker image for data library 2) update the library archive github actions 3) then run github actions in the repo and check if the data coming out this entire process is updated. Sure we have all templates in data library repo and we keep data library image updated this way but I think we lost a lot of flexibilities and make the maintenance of the tool much bigger part of our work.
wait I think we may have gotten signals crossed I'm not talking about moving the templates out of the pipeline repo right now. The more immediate issue is DRYing out the pipeline to make sure the filter for the correct sort of job only lives in one place. Although if we were going to keep A2 jobs in a separate table then maybe this line
AND jobtype ~* 'A1|DM|NB'
should stay where it is in sql/bin/_init.sql
.
Are there A2 jobs in dob now as well as bis? If not then making the A2 table can happen in sql/bin/_init.sql
right.
Housing Team found that some A2 jobs can bring about units change. Goal is to include those A2 jobs in HDB along with the A1 jobs.
Outstanding question:
What are filters to apply now we are including A2 for both the bis and now side of things.
would the current processing for CO automatically work this process? Or is additional logics required to bring on the units from COs?