UrbanCCD-UChicago / plenario

API for geospatial and time aggregation across multiple open datasets.
http://plenar.io
MIT License
154 stars 43 forks source link

hide datasets from API that failed to load #172

Closed derekeder closed 9 years ago

derekeder commented 9 years ago

Some contributed datasets fail part way through the ETL process for one reason or another. The table is created, but no data is loaded. Right now, the API does not filter out these failed datasets from being queried.

The /v1/api/datasets API should be updated to hide these empty datasets.

Example: http://plenar.io/explore#detail/dataset_name=farmers_markets_2013&obs_date__le=2015%2F01%2F01&obs_date__ge=2010%2F01%2F01&agg=week&resolution=500

Pinkalicious commented 9 years ago

Hi Derek,

I did upload a dataset and the ETL failed part way. I was not given a reason but a gateway error came. For reference, I was adding: https://ckannet-storage.commondatastorage.googleapis.com/2015-01-17T23:58:27.729Z/2009-2013-crime-statistics-hampton.csv

After failure, next time I uploaded, it said the URL of this dataset exists, but I couldnt locate the dataset on Plenar.io.

I believe, ETL should be more fault-tolerant and provide error messages all the way. I would like to know what is the way to do error handling?

Tanu

On Thu, Jan 22, 2015 at 10:47 AM, Derek Eder notifications@github.com wrote:

Some contributed datasets fail part way through the ETL process for one reason or another. The table is created, but no data is loaded. Right now, the API does not filter out these failed datasets from being queried.

The /v1/api/datasets API should be updated to hide these empty datasets.

— Reply to this email directly or view it on GitHub https://github.com/UrbanCCD-UChicago/plenario/issues/172.

derekeder commented 9 years ago

@Pinkalicious I'm unable to find the dataset on Plenar.io. Did you delete it after it failed?

When taking it through the first step in the contribute page, the dataset does not look to have been loaded: http://plenar.io/contribute?dataset_url=https%3A%2F%2Fckannet-storage.commondatastorage.googleapis.com%2F2015-01-17T23%3A58%3A27.729Z%2F2009-2013-crime-statistics-hampton.csv+

Could you provide a screenshot of the error next time you encounter it?

Pinkalicious commented 9 years ago

No, I did not delete and it wasn't uploaded as I got Gateway error.

Sure, I wish I had taken a screen shot.

Lets discuss a formal process for error tracking next week.

Tanu

On Thu, Jan 22, 2015 at 11:33 AM, Derek Eder notifications@github.com wrote:

@Pinkalicious https://github.com/Pinkalicious I'm unable to find the dataset on Plenar.io. Did you delete it after it failed?

When taking it through the first step in the contribute page, the dataset does not look to have been loaded: http://plenar.io/contribute?dataset_url=https%3A%2F%2Fckannet-storage.commondatastorage.googleapis.com%2F2015-01-17T23%3A58%3A27.729Z%2F2009-2013-crime-statistics-hampton.csv+

Could you provide a screenshot of the error next time you encounter it?

— Reply to this email directly or view it on GitHub https://github.com/UrbanCCD-UChicago/plenario/issues/172#issuecomment-71061528 .

derekeder commented 9 years ago

If it happened yesterday, I have a suspicion that the Gateway error was due to a huge query that @lucaluca was trying to execute. It was likely pegging the database pretty heavily, which could cause a Gateway timeout when trying to run other queries.

Mind trying it again?

Pinkalicious commented 9 years ago

I did not upload it day before yesterday, but sometime early this week.

I wish I could reproduce those results to you.

So I tried uploading other datasets to Plenar.io. Let me say a few things:

  1. First of all I cannot find readily available spatio-temporal datasets in CSV format from Google. Mostly they are compressed and currently there is no way of uploading compressed CSVs in Plenario. How hard is it to relax this?
  2. I did find some spatial datasets and several time-series datasets. One of the spatial datasets is http:// www.unitedstateszipcodes.org/zip_code_sample_commercial.csv

Now I believe one should be able to add this. But it has no observation date? Why is observation date mandatory? Why can't we add this data with observation date of today, since this is the date "I" observed it?

  1. Finally, I found the smallest spatio-temporal dataset I could find, which is available here: http://www1.ncdc.noaa.gov/pub/data/cdo/samples/PRECIP_HLY_sample_csv.csv

I faced a gateway error again. Attached is the screenshot.

When I try to upload it again, my fetching its details again. It says that dataset exists. Attached is the screenshot.

I look forward to discussing these with you.

On Thu, Jan 22, 2015 at 12:05 PM, Derek Eder notifications@github.com wrote:

If it happened yesterday, I have a suspicion that the Gateway error was due to a huge query that @lucaluca https://github.com/lucaluca was trying to execute. It was likely pegging the database pretty heavily, which could cause a Gateway timeout when trying to run other queries.

Mind trying it again?

— Reply to this email directly or view it on GitHub https://github.com/UrbanCCD-UChicago/plenario/issues/172#issuecomment-71067445 .

derekeder commented 9 years ago

Thanks for providing these details Tanu. Let's discuss these issues on the call this morning.

On Fri, Jan 23, 2015 at 10:12 AM, Pinkalicious notifications@github.com wrote:

I did not upload it day before yesterday, but sometime early this week.

I wish I could reproduce those results to you.

So I tried uploading other datasets to Plenar.io. Let me say a few things:

  1. First of all I cannot find readily available spatio-temporal datasets in CSV format from Google. Mostly they are compressed and currently there is no way of uploading compressed CSVs in Plenario. How hard is it to relax this?
  2. I did find some spatial datasets and several time-series datasets. One of the spatial datasets is http:// www.unitedstateszipcodes.org/zip_code_sample_commercial.csv

Now I believe one should be able to add this. But it has no observation date? Why is observation date mandatory? Why can't we add this data with observation date of today, since this is the date "I" observed it?

  1. Finally, I found the smallest spatio-temporal dataset I could find, which is available here: http://www1.ncdc.noaa.gov/pub/data/cdo/samples/PRECIP_HLY_sample_csv.csv

I faced a gateway error again. Attached is the screenshot.

When I try to upload it again, my fetching its details again. It says that dataset exists. Attached is the screenshot.

I look forward to discussing these with you.

On Thu, Jan 22, 2015 at 12:05 PM, Derek Eder notifications@github.com wrote:

If it happened yesterday, I have a suspicion that the Gateway error was due to a huge query that @lucaluca https://github.com/lucaluca was trying to execute. It was likely pegging the database pretty heavily, which could cause a Gateway timeout when trying to run other queries.

Mind trying it again?

— Reply to this email directly or view it on GitHub < https://github.com/UrbanCCD-UChicago/plenario/issues/172#issuecomment-71067445>

.

— Reply to this email directly or view it on GitHub https://github.com/UrbanCCD-UChicago/plenario/issues/172#issuecomment-71217080 .

Derek Eder (503) 577-0677 @derekeder https://twitter.com/#!/derekeder derek.eder@gmail.com DataMade.us