DigitalCommons / mykomap

A web application for mapping initiatives in the Solidarity Economy
3 stars 0 forks source link

Problems parsing large CSV files #229

Closed wu-lee closed 4 months ago

wu-lee commented 5 months ago

Describe the bug

[Cooperative World Map] The recently updated CUK dataset (January 2024) has 7248 organisations.

Converted into a standard CSV file as usual and consumed by the map at https://dev.coopsuk.solidarityeconomy.coop/, the map seems to fail to load the CSV, complaining in the dev console of an error, which seems to be the result of the CSV being truncated on download.

I've checked the file, this is fine, and parses correctly elsewhere.

Looking in the developer console, I can see that the string being parsed is truncated.

The size of the string being parsed, converted to bytes, matches the size of the first part of the Content-Range header:

Content-Range: bytes 0-5242879/5281573

Note that the latter number indicates that not all of the file was sent in this response - which is normal, and managed by the server. PapaParse, the CSV library, relies on this header to work correctly, which I know from testing it with mock HTTP responses - it doesn't work without it.

The truncated string typically chops the end of a CSV row in half - so manifests as a parse error. But what seems to be the problem is that only the first chunk is being parsed. This explains why the smaller CSV files we've used so far don't exhibit this problem - they fit into one chunk.

The same CSV file works fine in a unit test which sends the data in a single chunk.

To Reproduce

Object { dataset: "coops-uk2", error: {…} }
​dataset: "coops-uk2"
​error: Object { type: "FieldMismatch", code: "TooFewFields", message: "Too few fields: expected 32 fields but parsed 3", … }
​​code: "TooFewFields"
​​message: "Too few fields: expected 32 fields but parsed 3"
​​row: 7178
​​type: "FieldMismatch"
​​<prototype>: Object { … }
​<prototype>: Object { … }
map.ts:439:14
    startLoading map.ts:439
    onInitiativeLoadMessage map.ts:95
    MapPresenter map.ts:36
    invokeSubscriber postal.js:172
    cacherFn postal.js:529
    getCacher postal.js:418
    Lodash 2
    publish postal.js:538
    Lodash 4
    publish postal.js:537
    pub eventbus.ts:11
    aggregator data-services.ts:649
    fail data-aggregator.ts:128
    load csv-data-loader.ts:63
    (Async: promise callback)
    load csv-data-loader.ts:62
    outstanding data-services.ts:311
    loadDatasets data-services.ts:311
    loadDatasets data-services.ts:654
    loadData data-services.ts:616
    webRun index.ts:186
    <anonymous> index.ts:4
    <anonymous> map-app.js:77525
    <anonymous> map-app.js:77527

Expected behavior

All the data should be loaded and the corresponding organisations visible (7248). There should be no errors.

Desktop (please complete the following information):

Additional context

I think it may be the library at fault. It is meant to be able to stream-parse larger files than this, and I can do that using it in tests.

At the time of writing, however, the demo page for PapaParse seems to have problems with our CUK CSV file. And moreso, it has problems with its own test files.

I've filed a bug on GitHub outlining this here, let's see if what they say:

https://github.com/mholt/PapaParse/issues/1037

ColmDC commented 5 months ago

Okay. Well park and give time for someone responses to your post to emerge.

wu-lee commented 4 months ago

No replies at all so far. :/

ColmDC commented 4 months ago

How active is that project?

wu-lee commented 4 months ago

There seems to have been recent activity in the codebase. Issues - new ones being posted, about five comments in total on the most recent three. So perhaps active but not super active.

ColmDC commented 4 months ago

From element thread "...is to find or create a test which reproduced that issue without MM, then try and binary search the PapaParse library for the most recent version which doesn't have it This might provide a workaround: downgrade to that version. And it would help work out what the fix is (whether we or the PP devs do that) PP's test suite ought to have a test which covers the case."

rogup commented 4 months ago

I didn't end up doing a binary search for this. Think I've worked out what's happening by debugging.

Playing around with the PapaParse demo, I could see that the first chunk was getting processed fine, but then the next chunk (bytes 5242880-10485759) was hitting a net::ERR_CONTENT_DECODING_FAILED

This helpful guide https://www.joshcanhelp.com/accept-encoding-content-encoding-err-content-decoding-failed/ helped me to work out that it's because we seem to be receiving a content-encoding: gzip header, but the big.csv in the PapaParse demo was being served as a raw CSV, not a gzip. For some reason this only makes us hit an error after the first chunk.

We have the same problem with our server it seems, where if we try to get 'https://dev.data.solidarityeconomy.coop/coops-uk/standard.csv' whilst accepting gzip encoding, the server gives us back a content-encoding: gzip header but doesn't actually serve a gzip.

PapaParse uses a XMLHttpRequest when requesting a remote CSV, and this doesn't seem to have an option to disable accepting a gzip. It assumes the server will send the correct header https://stackoverflow.com/questions/22362169/why-does-the-xmlhttprequest-spec-prevent-setting-the-accept-encoding-header

So I don't think this is something that should be fixed in PapaParse, since it's not really broken (apart from their demo). Instead, we should fix our dev.data.solidarityeconomy.coop server to not send a gzip header. @wu-lee how would I go about doing this? I think maybe in the .htaccess file? I don't think I have ssh access to the server

And then I suppose that whoever is hosting papaparse.com will have to do the same to get the demo site working?

rogup commented 4 months ago

Ok, so I think the problem is not actually about the server gzip encoding negotiation, which I didn't fully understand.

Our data server is successfully gzipping the CSV, but it looks like Content-Encoding: gzip just doesn't work when chunking a response with Content-Range, since the entire CSV gets encoded before sending in chunks. So, because of the way gzip decompression works, the client can successfully decode the first chunk but not any chunks afterwards, since the start is needed for decompression.

It's explained here: https://stackoverflow.com/questions/33947562/is-it-possible-to-send-http-response-using-gzip-and-byte-ranges-at-the-same-time

So I think PapaParse needs to use a different method of chunking, or disable gzip compression when requesting a chunked response.

rogup commented 4 months ago

I don't think there's any easy solution to this, without completely changing how PapaParse does streaming. It seems to be a limitation of HTTP right now, if we want to use Content-Range. The encoding settings in the header cannot be set programmatically, since they're set automatically by user agents e.g. requests via cloudflare will have their encoding headers changed https://developers.cloudflare.com/fundamentals/reference/http-request-headers/

I think we just need to disable streaming CSVs and get PapaParse to download the whole file in one chunk. In fact, I think this is more efficient anyway since it will reduce the number of HTTP requests and save time. I don't think streaming is actually leading to many memory efficiencies, since we're basically saving the whole CSV into memory anyway. It would only bring efficiencies if we were simply processing the CSV then throwing away lots of the data.

@wu-lee what do you think about disabling streaming?

wu-lee commented 4 months ago

Interesting observations.

So in my investigations, it was clear that (some) un-compressed data was arriving because I could see it, plus the parser wouldn't be able to decode compressed data.

I would guess everything at the level of the HTTP server and the browser is working normally. So I think we agree it's something to do with PapaParse's streaming. It would seem like they ought to have had something working once, at least, since they claim to have large file support, and gzip encoding is not at all unusual (in fact as in our case, it seems to be standard out of the box behaviour with Apache).

My hunch is that some recent work broke the streaming in a way that their test suite missed.

We could disable streaming and see if it works - this is just an option to PP, and I suspect it could be flipped in the debugger, even. This would confirm your theory.

It isn't true that the CSV is just downloaded into the memory verbatim and used - we are indeed processing and throwing away that CSV data. The plain text CSV is parsed and transformed into another data structure a row at a time. So I think streaming will help it cope with dealing with large datasets, and I'd need to feel a a bit desperate to want to lose it to be able to deal with larger datasets. I'm also not totally certain it would speed up the download significantly: as HTTP is likely well optimised these days I suspect the chunks may well be mediated in a single TCP connection, the overhead would be a few extra header packets.

But in order to show it's a streaming problem, sure, disable it.

rogup commented 4 months ago

I think that the first chunk is being successfully transferred and decompressed because, although it's incomplete, it contains the gzip header so can be decompressed. The error that PapaParse code is hitting for the second chunk is (61) Error while processing content unencoding: incorrect header check, which indicates that decompression is failing (because it doesn't contain the gzip header).

I see what you mean about PapaParse was probably working with streaming and gzip in the past since it's had lots of testers. I'll check against the first version of PapaParse.

And yes, I suppose we're row transforming the data, so maybe this cause some duplication and inefficiency in memory. I wonder if there's a way to make this efficient in our processing code (e.g. throwing away rows from the original data after processing them), which would be just as efficient as downloading and processing chunks. I think this would be a simpler fix than forking PapaParse, since the issue seems to be quite baked into their current implementation.

rogup commented 4 months ago

Ok, some new intel.....

I just tested the PapaParse demo again (after >1 day of not testing it), and now it's working with streamed data.

But... if I now:

  1. empty cache and do a hard reload
  2. re-run the demo without streaming ---> success
  3. refresh (keeping the browser cache) and re-run the demo with streaming ---> failure after the first chunk, with the same error that I was picking up before, and for the reasons that I described i.e. decompression failing for a chunk of a gzip when the whole file was compressed in one.

This is repeatable for me, and if I reset the browser cache, streaming works again.

@wu-lee please can you test that this is repeatable for you?

This indicates that the caching functionality isn't working with the Content-Range header when gzip is used, since it's just serving up a chunk of the cache post-compression, rather than chunking on the server side pre-compression.

This isn't exactly the same as the Mykomaps bug, since it never works, even after resetting the browser cache. But maybe it has something to do with the caching on our server.

rogup commented 4 months ago

I think all of this analysis above was a red herring, and we were just hitting the bug in the demo due to browser caching, which is different to what's happening in Mykomap.

In the Mykomap console logs, I'm seeing this error in PapaParse, which I'd missed before:

Screenshot 2024-03-08 at 20 30 23

This part of the PapaParse code is failing

function getFileSize(xhr)
        {
            var contentRange = xhr.getResponseHeader('Content-Range');
            if (contentRange === null) { // no content range, then finish!
                return -1;
            }
            return parseInt(contentRange.substring(contentRange.lastIndexOf('/') + 1));
        }
    }

In particular, xhr.getResponseHeader('Content-Range') since the CORS policy means that this fails for cross-origin requests to our data server https://stackoverflow.com/a/7463297

I've just tried on a browser with CORS checks disabled and Mykomaps is successfully downloading all the data!

rogup commented 4 months ago

@wu-lee We need to add the Access-Control-Expose-Headers: Content-Range header in responses from dev.coopsuk.solidarityeconomy.coop in its CORS policy

See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Access-Control-Expose-Headers

Please can you give me access to this server (dev-1 I think, judging by the ansible config?)

wu-lee commented 4 months ago

Had a search for PapaParse and the ACEH header, expecting to find someone with a similar problem, but didn't really find much. This seems the closest match I can find:

https://github.com/mholt/PapaParse/issues/835

rogup commented 4 months ago

Yes I think that person is hitting the same problem, and I've mentioned this solution on that ticket.

I added Header add Access-Control-Expose-Headers "Content-Range" to the Apache custom.conf for dev.data.solidarityeconomy.coop and it now appears to be working fine.

Think this can be closed now @wu-lee

ColmDC commented 4 months ago

Nice. I can confirm that https://dev.coopsuk.solidarityeconomy.coop/ is loading again and it seems faster than before. Might that be the case?

rogup commented 4 months ago

I'm not sure why it would load faster. Maybe it's because your browser wasn't caching the failed partial response before but now it is? If you empty cache and hard reload (on Chrome, show the inspect window and then right click refresh), maybe it will take longer to reload again. Or maybe there is some improvement in the caching server side.

ColmDC commented 4 months ago

As you say, several things could have influenced previous actual and perceived/remembered load times. Not worth further invsestigation.

wu-lee commented 3 months ago

Note, the fix is to update the file /var/www/vhosts/data.solidarityeconomy.coop/custom.conf on dev-1 to add line 6 here:

<Files ~ "\.(csv|ttl|rdf|json)$">
  #  allow browser scripts to access the data here directly
  Header add Access-Control-Allow-Origin "*"
  Header add Access-Control-Allow-Headers "origin, x-requested-with, content-type, range"
  Header add Access-Control-Allow-Methods "GET, POST, OPTIONS"
  Header add Access-Control-Expose-Headers "Content-Range"
</Files>

This needs to be done on the other servers which can be used for CSV downloads by maps. So I've updated the same file on prod-1, and the equivalent file on dev-2 and prod-2: dev-2:/var/www/vhosts/data.digitalcommons.coop/custom.conf.

(There's not really a home for this file in a repo currently. It's on my mental list for some consideration.)

rogup commented 3 months ago

@wu-lee Ah thanks, I didn't realise there was more than 1 server. Maybe it could be added to Ansible?

wu-lee commented 3 months ago

Possibly should be - problem is that the Ansible stuff's boundary of knowledge stops at the details of all the applications that are installed (i.e. what's in custom.conf in this case). So I would say it's more within the domain of the applications themselves to keep track of these things. So we could create a git repo for it -but a very small one. The data publishing virtual host is a degenerate case of an application.