CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

Duplicate Persons in cdp-seattle instance #222

Open BrianL3 opened 1 year ago

BrianL3 commented 1 year ago

Describe the Bug

There are duplicate documents in the Person collection in the cdp-seattle instance. They have the same name and the same legistar person id. I think this may result in unexpected behavior in the front end.

Expected Behavior

There should be one and only one record of legistar persons per... person.

Reproduction

Check out records for Tammy Morales: ID 87638bc6-fd68-4f1f-8449-6137ac242a8 ID 4bf88f27-9933-4c57-8819-342111a6a68c

Dan Strauss: 0996b93d-fdbb-41d4-b488-1dc85ac37366 2ff3c312-04cd-4e84-b46e-5989f239259

Mosqueda and Lewis also have dupes.

dphoria commented 1 year ago

I think there is a possibility that this ends up being an issue for cdp-scrapers. There is mechanism in place there to handle situations like this, i.e. erratic/duplicate/etc. information entered by the municipalities/clerks.

So, looks like different IDs were entered for the same person. I haven't investigated this at all, but this is my guess for the time being.

dphoria commented 1 year ago

Wait, I'm confused by this issue. In the CDP Seattle instance DB, I see only 1 Tammy Morales. e.g. If I follow the quickstart example, and query the person collection, there is just 1 Tammy Morales.

from cdp_backend.database import models as db_models
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

fireo.connection(client=Client(
    project="cdp-seattle-21723dcf",
    credentials=AnonymousCredentials()
))

ppl = list(db_models.Person.collection.fetch())

for p in ppl:
    if 'tammy' in p.name.lower():
        print(p.name, p.external_source_id, p.id, p.key)

# Tammy J. Morales 662 d1dbed7401e6 person/d1dbed7401e6

If this issue is saying that the Legistar end point for Seattle is returning multiple records for Tammy Morales (and others), that is known, unfortunately. And we have a system in place on the scrapers side to at least help us deal with those situations. Definitely possible it's not working 100%, but if so, shouldn't I be able to see multiple Tammy Morales when I execute the code blob above?

I think I'm probably not looking at the same "database" that Brian used to get those IDs...

evamaxfield commented 1 year ago

Can also confirm from the database directly that there are not two people of the same name.

evamaxfield commented 1 year ago

Where did you get those IDs btw? the IDs in the firestore database are much much shorter

BrianL3 commented 1 year ago

I’d link but I’m on my phone on a ten lane freeway in Texas. The firestore document PKs can be rather long? I grabbed these directly from the Seattle firestore console view of the DB. Maybe I was looking at dev?

On Thu, Nov 17, 2022 at 5:00 PM Eva Maxfield Brown @.***> wrote:

Where did you get those IDs btw? the IDs in the firestore database are much much shorter

— Reply to this email directly, view it on GitHub https://github.com/CouncilDataProject/cdp-backend/issues/222#issuecomment-1319318957, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE4BLXVEH2F5ABZIKEWBQDWI22IZANCNFSM6AAAAAASBYZRSU . You are receiving this because you authored the thread.Message ID: @.***>

dphoria commented 1 year ago

I think I'm gonna pull some events on the scraper and check out the ingestion model Persons. Will report back.

evamaxfield commented 1 year ago

i dont think it is the scraper. and i think you were checking staging (should probably refresh the data on staging since its a bit behind i think).

I think it is just a minutes item / and event minutes item ref that is broken somewhere. I will look into this weekend -- no worries.