ISG-ICS / Wildfires

7 stars 4 forks source link

Duplicate Fires Crawled #23

Open Yicong-Huang opened 4 years ago

Yicong-Huang commented 4 years ago

There are multiple entries with the same fire name in the database. related to Fire data runnable.

@ScarlettZ98 can you check please?

ScarlettZ98 commented 4 years ago

What is the name of the fire please?

Yicong-Huang commented 4 years ago

from fire table:

Kincade
Burris
Getty
Kincade
Kincade
Kincade
Getty
Kincade
Kincade
Kincade
Kincade
Kincade
Contempo
Kincade
Palisades
Kincade
Tick
Kincade
Saddle ridge
Palisades
Palisades
Paradise west
Palisades
Saddle ridge
Caples
Caples
Franklintrail

from fire_history table:

Contempo
Burris
Getty
Getty
Tick
Real
Getty
Tick
Real
Paradise_West
Palisades
Kincade
Franklintrail
West
Wendy
Walker
W-1_Mcdonald
Ukonom
Taboose
Star
South
Schaeffer
Saddle_Ridge
Rosasco
Red_Bank
Red

from fire_merged table:

Getty
Contempo
Palisades
Kincade
Tick
Tick
Paradise west
Paradise west
Paradise west
Palisades
Franklintrail
Real
Real
Wendy
Saddle ridge
Dehesa
Caples
Briceburg
West
Lopez
Schaeffer
Mcmurray
Bautista
Rosasco
Jakes
Kidder 2
ScarlettZ98 commented 4 years ago

Fire table is supposed to have records with the same names since the id is the primary key. Fire history table only uses the name and year of the fire so it doesn't matter. Fire merged table may have records with the same name also. If ids in fire table and fire merged doesn't match, then it is an issue. Names can be duplicated

Yicong-Huang commented 4 years ago

thanks. what about the fire id in fire_merged table then? shouldn't them be unique?

ScarlettZ98 commented 4 years ago

I just checked the table, and there is an issue that Paradise west is created multiple times. I will look into it this weekend.

Yicong-Huang commented 4 years ago

Thanks.

Is it hard to clean the data that is corrupted (duplicated)?

I assume we can just delete the corresponding records and then rerun the fixed crawler?

ScarlettZ98 commented 4 years ago

No. I will drop them and recrawl after fix it. But it is hard for me to test the daily use of the crawler. Some issues don't appear before because when I test it, the time separation is not so long.

Yicong-Huang commented 4 years ago

Maybe we can discuss more about the details of the strategy to merge fires? seems right now it is a static separate days threshold?

ScarlettZ98 commented 4 years ago

Right now it is not. Every page in the gov website is a merged fire, it crawls the website and gives it an id, then fire with the id is the merged fire.

Yicong-Huang commented 4 years ago

maybe it's better to do a F2F discussion?

ScarlettZ98 commented 4 years ago

Yes, but I don't have time today. I can do it tomorrow

Yicong-Huang commented 4 years ago

No urgent. Let's move the discussion to slack, and please schedule a meeting with me if possible.

Yicong-Huang commented 4 years ago

any updates?