Some places, such as us-dc, report their data in Excel spreadsheets. Add crawl and parse support for that.
Work-in-progress branch add-excel-parsing on master repo
There are some libraries that parse xlsx. It seemed simple to add, but at the moment it breaks on crawl -- the crawled file is not a valid xslx file. This branch contains a test (marked with only) that demonstrates the problem: all the test does is crawl a local "fake source" .xslx file, and then it checks that the file in the crawler-cache can be parsed in the same way that the "fake source" can be:
$ git fetch upstream
$ git checkout -b upstream/add-excel-parsing add-excel-parsing
$ npm run test
... etc
crawled Excel file has same parseable content as source
sanity check of src sheets
Sandbox Found Architect project manifest, starting up
Created test cache /Users/jeff/Documents/Projects/li/zz-testing-fake-cache
Created test report dir /Users/jeff/Documents/Projects/li/zz-reports-dir
Wrote to local cache: /Users/jeff/Documents/Projects/li/zz-testing-fake-cache/excel-source/2020-08-12/2020-08-12t20_27_54.266z-default-59988.xlsx.gz
...
x Error: End of data reached (data length = 10043, asked index = 347979759). Corrupted zip ? (fail at: undefined)
If we can get this crawl method to work, we can get Excel crawls and scrapes in general to work.
Things tried to get crawl to work
The "crawl" method (src/events/crawler/crawler) actually calls src/http/get-get-normal/index.js to get the file. I've tried:
setting the Content-Typein get-get-normal
setting content type in events/crawler/crawler/index.js (the got call)
Description.
Some places, such as us-dc, report their data in Excel spreadsheets. Add crawl and parse support for that.
Work-in-progress branch
add-excel-parsing
on master repoThere are some libraries that parse xlsx. It seemed simple to add, but at the moment it breaks on crawl -- the crawled file is not a valid xslx file. This branch contains a test (marked with
only
) that demonstrates the problem: all the test does is crawl a local "fake source" .xslx file, and then it checks that the file in the crawler-cache can be parsed in the same way that the "fake source" can be:If we can get this crawl method to work, we can get Excel crawls and scrapes in general to work.
Things tried to get crawl to work
The "crawl" method (
src/events/crawler/crawler
) actually callssrc/http/get-get-normal/index.js
to get the file. I've tried:Content-Type
in get-get-normalSome other people ran into this trouble as well -- e.g. see https://github.com/SheetJS/sheetjs/issues/337.
A minimal repo
... demonstrating this is at https://github.com/covidatlas/arc-excel-downloading-trouble.