agrc / ugs-db

Database seeder for UGS data
2 stars 1 forks source link

Update wqp #37

Closed steveoh closed 9 years ago

steveoh commented 9 years ago

I'd like an early preview review please

stdavis commented 9 years ago

Looks like the query for new dates may be off a little. I ran the command three times in a row this morning and got this: image

stdavis commented 9 years ago

This looks like it's headed in the right direction to me other than the possibly duplicate data that I mention above. Although I couldn't find the duplicate data in my database so maybe it's just the script reporting that's incorrect?

steveoh commented 9 years ago

The first time you ran it new dates were added. So the query the next time updated with the most current date. The script should not run twice in the same day since like you noticed you will get that days results. What were you expecting? Consistent numbers for all three runs? Or an eventual 0?

steveoh commented 9 years ago

refs #2

steveoh commented 9 years ago

give this a once over @stdavis.

stdavis commented 9 years ago

Looks like there are some failing tests.

steveoh commented 9 years ago

i think this is ready. Your call @stdavis

stdavis commented 9 years ago

Just a couple more comments/questions. I think that it's pretty close.

steveoh commented 9 years ago

since that diff is outdated here is the comment now.

I think we do need to do something here.

Two scenarios I mapped out.

  1. You get 123 and 123_wqx where 123 is already in the database.

    I think in this situation we would insert a duplicate station.

  2. You get 123 and 123_wqx where neither is in the database.

    I think we insert the correct one.

If we strip the wqx prior, we fix 1, but then we would insert the non wqx station in example 2 since extract_stations_by_id would extract the non wqx station from the response.

Can you confirm this so I feel better about it?

steveoh commented 9 years ago

I think that you should do a similar check for existing data (using the SampleId field?) for results that you do with stations above to prevent duplicate data from polluting the database. I would expect that an update process should be able to be run more than once per day without negative effects.

promoting this as well

stdavis commented 9 years ago

since that diff is outdated here is the comment now.

I think we do need to do something here.

Two scenarios I mapped out.

You get 123 and 123_wqx where 123 is already in the database.

I think in this situation we would insert a duplicate station.

You get 123 and 123_wqx where neither is in the database.

I think we insert the correct one.

If we strip the wqx prior, we fix 1, but then we would insert the non wqx station in example 2 since extract_stations_by_id would extract the non wqx station from the response.

Can you confirm this so I feel better about it?

I think that's correct but would like to talk through it with you.

steveoh commented 9 years ago

I think this is done. There is no unique field in results to check for unique. so when this is ready lets :shipit:

stdavis commented 9 years ago

Couldn't you check for sampleID to see if the batch has already been inserted? I think that would be enough to prevent duplicate data.

On Thu, Oct 8, 2015 at 5:31 PM, steveoh notifications@github.com wrote:

I think this is done. There is no unique field in results to check for unique. so when this is ready lets [image: :shipit:]

— Reply to this email directly or view it on GitHub https://github.com/agrc/ugs-db/pull/37#issuecomment-146715581.

Scott Davis Utah AGRC http://gis.utah.gov/developer/ stdavis@utah.gov @SThomasDavis https://twitter.com/SThomasDavis

steveoh commented 9 years ago

I would ask Paul first

inkenbrandt commented 9 years ago

SampleID should take care of that. It should be closely tied to the date of the sample.

steveoh commented 9 years ago

:confetti_ball: i think it should be ready @stdavis :mag: :man: :mag_right: