CityOfNewYork / CROL-Overview

City Record Online parsing libraries and supporting files
26 stars 14 forks source link

Document how to verify data source #28

Closed cds-amal closed 9 years ago

cds-amal commented 9 years ago

I don't understand how to verify our data is correct. This should be document. For example, Citywide Administrative Services pubic hearings has 17/20 malformed messages -- is this correct? How to reconcile?

ie. The following is a malformed example of one of the 17.

NOTICE IS HEREBY GIVEN THAT A REAL PROPERTY ACQUISITIONS AND DISPOSITIONS PUBLIC HEARING

mikaelmh commented 9 years ago

Checking the validity of a Data Entry

  1. Use the RequestID to find the start and due date of the ad.
  2. Go to http://www.nyc.gov/html/dcas/html/about/cityrecord_editions.shtml and select a city record with a publication date between the start and due date.
  3. Find the relevant add using keyword source.

I did the above process with RequestID 20141030107, which was one of DCAS bid ads which had a problem, and found entry that had limited data in the datadump.

image

My theory is that this is perhaps linked to that these add are both created and published by DCAS themselves. Perhapt the final data is saved anywhere arise? I'll contact them.

cds-amal commented 9 years ago

@mikaelmh1, please comment on #32.

  1. I would like to create a table with links that would take us to the original source for verification, or at least a landing page where we can do further queries. This should make the process easier. thoughts?
  2. Is it possible to get DCAS to add an HTML link to the original (PDF) data in the CSV data dumps?
cds-amal commented 9 years ago

@mikaelmh1, This issue is stopping my progress. Some where upstream, the data became corrupt. My understanding is DCAS gave us data dumps from an internal process that may or may not be sourced from PDF scraping. Can we reach out to them for assistance?

mikaelmh commented 9 years ago

Hmm, this is indeed a problem. I'm wondering if it is a conversion issue (the file was given to us as a csv), or are we getting the export I the strong DB link. Did you check out the other csv? The old DB dump export for 1 month? If these mostly are working it might be the export dumb we got, and  we could ask for the original .NET dump and then convert it again ourselves. I'll check it out tonight - let me know if you get to it before that. 

As for the DB dumb, the data and Schema we have are from the actual database DCAS structure, and not from PDFs.(the schema is used to generate new PDFs though).

We will def follow up with DCAS on this. Any other questions you have the moment?

On Wed, Apr 1, 2015 at 10:33 PM, kiddle notifications@github.com wrote:

@mikaelmh1, This issue is stopping my progress. Some where upstream, the data became corrupt. My understanding is DCAS gave us data dumps from an internal process that may or may not be sourced from PDF scraping. Can we reach out to them for assistance?

Reply to this email directly or view it on GitHub: https://github.com/CityOfNewYork/CROL-PDF/issues/28#issuecomment-88704023

cds-amal commented 9 years ago
  1. Null entries are apparently valid as this is part of DCAS' workflow. There is enough information to access the PDF document from DCAS' PDF repository.
  2. Incomplete entries remains a mystery -- examine the original DCAS dump to see if it has the issue is up-or-down stream of us getting the data. See #34