NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
25 stars 12 forks source link

Support HTTP Range Requests in MNRead.get #1709

Open robyngit opened 10 months ago

robyngit commented 10 months ago

Detect and handle HTTP range requests to enable clients to retrieve a portion of a file without the need to download the entire content. This feature would allow MetacatUI and other clients to preview data files before downloading them. It would also allow clients to resume downloads in the event of a network interruption.

Note: https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A24b85258-3e86-40cb-accc-28153513dea8 gives a 100,000 line CSV file that could be useful for testing

taojing2002 commented 10 months ago

@robyngit Two questions:

  1. Does this feature only support the text data files (e.g, cvs)? How about Excel files?
  2. What are the units of the range? Lines or bytes or both?
mbjones commented 10 months ago

@taojing2002 good questions. Range requests are byte-based requests, basically specifiying a byte range to be requested. They are application-agnostic, and assume that the client knows what to do with the bytes. Tools like curl use range requests to allow resuming downloads if a network connection is interrupted. Data systems use range requests to retrieve chunks of data from inside a data file, but that is of course only useful if the data files are organized in such a way that contiguous byte ranges produce meaningful chunks. So, for text files, getting the first few KB is a good way to get a preview, but the client would need to be aware that the byte boundary is unlikely to correspond with the end-of-line delimiter used in that format. In contrast, netCDF, HDF5, and Zarr are binary formats that allow byte range requests that can get specific segments of data that correspond to specific scientifically meaningful chunks (e.g., a single image out of a time series, or a specific spatial window out of a larger extent). Hope that's all helpful.

taojing2002 commented 10 months ago

@mbjones Thanks! So I think we will use bytes for the range for any formats. The clients have the responsibility to parse the bytes.