NYCPlanning / db-developments

🏠 🏘️ 🏗️ Developments Database
https://nycplanning.github.io/db-developments
8 stars 2 forks source link

BIS Deduplication First Step #503

Closed td928 closed 2 years ago

td928 commented 2 years ago

Address issue #502 it was spotted in the process ingesting the dob now permit data. BIS data ingestion pipeline is broken because the unique id is no longer unique. It is then determined that the '01' docu number needs to further deduplicate by the dobrundate.

dob_jobapplications.yml

This is only real update compare to the other feature branch. A Row_Number() is run on the initial sql query in the data library process to add a group id gid to the rows so that records with the same job number will get grouped and sorted based on dobrundate. It should be noted the dobrundate is getting some additional processing to sort properly because the original MM/DD/YYYY format from the source is not good for sorting as the sqlite does not have strong date types accoding to this stackoverflow post. Also, since the data library sql functionality does not support very complicated sql query(for more details and context of this data library limitation check out this issue writeup) the filtering on the gid will take place in the ./sql/bis/_init.sql and will be incorporated in the #501 PR.

testing

to test this functionality, a good exercise is to run the first ./bash/01_dataloading.sh step in local postgis and weekly mode option must be used to ensuredob_jobapplications.sql is the latest version.

sql/_status_q.sql

I don't know why sql/_status_q.sql is included in the file changed because all the changes is in the other branch already. It should be ignored since I only pulled in from the other branch to do some testing

SashaWeinstein commented 2 years ago

Is there a way to get _status_q.sql out of the list of files changed? If it includes the most up to date work you could use git checkout to move the version of _status_q.sql from branch 502 to 500, then it would come out of the list of files changed

td928 commented 2 years ago

Is there a way to get _status_q.sql out of the list of files changed? If it includes the most up to date work you could use git checkout to move the version of _status_q.sql from branch 502 to 500, then it would come out of the list of files changed

I think this is precisely what I did to get those changes. Genuinely baffled why they are included in the list because of that. Let me google if anything I can do about this

SashaWeinstein commented 2 years ago

If it's not annoying may I look over your shoulder while you find a fix? This is the sort of thing I need to get better about myself

abrieff commented 2 years ago

I'd guess that rebasing on the latest 500-DOB-etc and fixing the merge conflicts might do it?

SashaWeinstein commented 2 years ago

Andrew can you show us how you would do this lol

abrieff commented 2 years ago

Yeah, there's a couple of things I'd try, happy to talk through it.

abrieff commented 2 years ago

Try git rebase -i 500-DOB-Now-Permit-Ingestion locally, save the dialog that pops up (:wq), then force push back to this branch (git push -f). This will mess up other people's local branches, so to get your local version back in order if you're not whoever ran this, do git reset --hard origin/502-BIS-Ingestion-Refactor

td928 commented 2 years ago

Try git rebase -i 500-DOB-Now-Permit-Ingestion locally, save the dialog that pops up (:wq), then force push back to this branch (git push -f). This will mess up other people's local branches, so to get your local version back in order if you're not whoever ran this, do git reset --hard origin/502-BIS-Ingestion-Refactor

didn't get a lot of conflicts on rebase. Worked beautifully. Thanks @abrieff !