covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Add Excel parsing (have broken work-in-progress branch) #564

Closed jzohrab closed 3 years ago

jzohrab commented 3 years ago

Description.

Some places, such as us-dc, report their data in Excel spreadsheets. Add crawl and parse support for that.

Work-in-progress branch add-excel-parsing on master repo

There are some libraries that parse xlsx. It seemed simple to add, but at the moment it breaks on crawl -- the crawled file is not a valid xslx file. This branch contains a test (marked with only) that demonstrates the problem: all the test does is crawl a local "fake source" .xslx file, and then it checks that the file in the crawler-cache can be parsed in the same way that the "fake source" can be:

$ git fetch upstream
$ git checkout -b upstream/add-excel-parsing add-excel-parsing
$ npm run test

... etc
  crawled Excel file has same parseable content as source

    sanity check of src sheets
    Sandbox Found Architect project manifest, starting up
    Created test cache /Users/jeff/Documents/Projects/li/zz-testing-fake-cache
    Created test report dir /Users/jeff/Documents/Projects/li/zz-reports-dir
    Wrote to local cache: /Users/jeff/Documents/Projects/li/zz-testing-fake-cache/excel-source/2020-08-12/2020-08-12t20_27_54.266z-default-59988.xlsx.gz
...

    x Error: End of data reached (data length = 10043, asked index = 347979759). Corrupted zip ? (fail at: undefined)

If we can get this crawl method to work, we can get Excel crawls and scrapes in general to work.

Things tried to get crawl to work

The "crawl" method (src/events/crawler/crawler) actually calls src/http/get-get-normal/index.js to get the file. I've tried:

Some other people ran into this trouble as well -- e.g. see https://github.com/SheetJS/sheetjs/issues/337.

A minimal repo

... demonstrating this is at https://github.com/covidatlas/arc-excel-downloading-trouble.

jzohrab commented 3 years ago

Done and merged!