WPRDC / wprdc-etl

MIT License
8 stars 3 forks source link

added first sample CSVs #12

Closed saylorsd closed 8 years ago

saylorsd commented 8 years ago

It looks like most of the old CSVs that we pulled have been cleaned out from the folder I pull them from. I'll have to look into why.

However, I was able to get sample files for three of our bigger CSV jobs.

bsmithgall commented 8 years ago

Looks like the PLI is in binary format for some reason so that should be fixed.

We don't need everything, just a few sample rows. You can pull those out with this awk script:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'

or this Perl script (if you have Perl)

perl -ne 'print if (rand() < .01)' your_file.txt

saylorsd commented 8 years ago

I pulled samples from the files - no problem.

However, it seems pli_violations is encoded in UTF-8-BOM. Could it be possible that the BOM is what's causing git hub to read it as a binary file? Also, the file is tab-delimited. If it's the encoding, I could just manually change it and we can use that for test purposes. I'm only hesitant to do so because then it would vary from what the city provided.

bsmithgall commented 8 years ago

Nope, that's perfect. Leave it is as close as possible to the provided stuff.

bsmithgall commented 8 years ago

@saylorsd are we confident that the city always produces files with BOM?

saylorsd commented 8 years ago

@bsmithgall, for the PLI violations data, yes. However, they may use different vendor products to produce other data so I can't be sure for anything else.

bsmithgall commented 8 years ago

ugh

bsmithgall commented 8 years ago

I think the file is actually utf-16 encoded...

saylorsd commented 8 years ago

That's possible. I took notepad++'s listing as the gospel truth on it, but after a short search it seems that it's merely a suggestion.

saylorsd commented 8 years ago

I just opened it in window's notepad, and noticed it started with , which according to this post indicates it's ISO-8859-1.