Closed saylorsd closed 8 years ago
Looks like the PLI is in binary format for some reason so that should be fixed.
We don't need everything, just a few sample rows. You can pull those out with this awk script:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
or this Perl script (if you have Perl)
perl -ne 'print if (rand() < .01)' your_file.txt
I pulled samples from the files - no problem.
However, it seems pli_violations
is encoded in UTF-8-BOM. Could it be possible that the BOM is what's causing git hub to read it as a binary file? Also, the file is tab-delimited. If it's the encoding, I could just manually change it and we can use that for test purposes. I'm only hesitant to do so because then it would vary from what the city provided.
Nope, that's perfect. Leave it is as close as possible to the provided stuff.
@saylorsd are we confident that the city always produces files with BOM?
@bsmithgall, for the PLI violations data, yes. However, they may use different vendor products to produce other data so I can't be sure for anything else.
I think the file is actually utf-16 encoded...
That's possible. I took notepad++'s listing as the gospel truth on it, but after a short search it seems that it's merely a suggestion.
I just opened it in window's notepad, and noticed it started with 
, which according to this post indicates it's ISO-8859-1.
It looks like most of the old CSVs that we pulled have been cleaned out from the folder I pull them from. I'll have to look into why.
However, I was able to get sample files for three of our bigger CSV jobs.