Nonprofit-Open-Data-Collective / irs-990-data-issue-tracker

A place to aggregate questions about IRS 990 data access, documentation, meta-data, and inconsistencies or errors. This is NOT a forum for questions on analyzing the data. Contributors are volunteer experts, not IRS personnel.
https://nonprofit-open-data-collective.github.io/irs-990-data-issue-tracker/
3 stars 0 forks source link

Files uploaded from 2021 on are zipped using proprietary compression algorithm Deflate64 #2

Open HFAwesomeCharts opened 1 year ago

HFAwesomeCharts commented 1 year ago

Files uploaded from 2021 on are zipped using a new proprietary compression algorithm. The IRS is now using a compression algorithm called Deflate64 that isn't supported by Python out of the box. (It's the default on the new Windows OS so the IRS may not even be aware that it's proprietary.) If you're using Python, you have to install Deflate64 before you can automatically download and unzip the files.

lecy commented 1 year ago

Any tips on which Deflate64 program to use?

https://pypi.org/project/zipfile-deflate64/ ??

It worked fine in R but I am on a windows machine so it might be the OS not the program that mattered.

HFAwesomeCharts commented 1 year ago

Yes, that is the one I installed. I used the command "pip install zipfile-deflate64" to install it. I am on a Windows machine, too, though...

HFAwesomeCharts commented 1 year ago

Tagging @tone711 to make sure I'm answering this correctly...

mfdgit2 commented 1 year ago

Microsoft's System.IO.Compression doesn't support Deflate64 so we are unzipping them from the command-line. This is inconvenient but not a show-stopper. If ZIP's were limited to <2GB they could use the normal ZIP compression.

tone711 commented 1 year ago

@HFAwesomeCharts this looks great. If Python and .Net (System.IO.Compression) don't support Deflate64 out of the box, it shows the more significant extent of the issue.