Unstructured-IO / unstructured-api

Apache License 2.0
446 stars 101 forks source link

fix: receive csv when `output_format=text/csv` #373

Closed ds-filipknefel closed 5 months ago

ds-filipknefel commented 5 months ago

Issues with not receiving csv formatted answer stem from looking up accept: headers to derive in what format we want to send a response.

I suggest that output_format should be wholly in charge of this and that we get rid of raising 406 errors when there's a mismatch between accept headers and response type as user explicitly knows what format they'll get.

The only exception for it is for multipart/mixed but this solution does everything the same when accept: multipart/mixed is provided (which right now I'm not sure works over all, I didn't manage to successfully send a request through curl with this header included)

ds-filipknefel commented 5 months ago

Please update tests with csv conversion (pd.read_csv) for example here (another one is in smoke test). I don't think we need this if csv is supported. Otherwise LGTM!

I actually think using pd.read_csv is good here and would rewrite my tests to use it as well, given that our csv does describe a table. If we operate on csv str to e.g. check number of elements we can run into trouble, I've actually run into that right now when trying to remove pd.read_csv. A single value in csv represented table can span multiple lines so we can't rely on counting lines to validate number of elements.

Letting pandas parse this csv into a table makes it easier.