jieguangzhou commented 3 weeks ago

To replace this functionality, we define a new DataType to handle it. Here is an example with S3:

We create an S3 DataType with a parameter, pre_download.

The logic when pre_download is True and encodable as File

During encoding:

Download the file from S3 and create a FileEncodable with the specified download path. After saving the data, the file/folder will be stored in the artifact.

During decoding:

Retrieve the file/folder from the artifact.

This logic is similar to the file encodable logic that retrieves files from artifacts.

The logic when pre_download is False:

During encoding:

Return the original S3 path.

During decoding:

Download the data from S3. RemoteData calls a download module, which provides the logics for loading remote files/URIS.

RemoteData

class RemoteData(_BaseEncodeble):
    type: str ["s3", 'http']
    x: xxxxx

download module

def load_from_s3(url, **kargs):
       ....

def load_html(): ...

def load_file(): ....

Example

Pre Download

from superduperdb.components.datatype import HttpPredownload

data = HttpPredownload("https://superduperdb.com/xxx")

# 1. download the data to /tmp/xxx
# 2. Save the data to artifact store
# 3. Create a file encodable.
db['documents'].insert_one({"data": data})

# 1. load the encodable
# 2. Init the encodable and pull the file from arfiact store
# 3. return the file.x via file.unpack()
db['documents'].find_one() # That will init the file encodable and we can get the real file vis the file.x (xxxx.html)

No Pre Download

from superduperdb.components.datatype import Http

data = Http("https://superduperdb.com/xxx")

# save the "https://superduperdb.com/xxx" as x
db['documents'].insert_one({"data": data})

# 1. download the data to /tmp/xxx
# 2. return the path /tmp/xxx
db['documents'].find_one() # That will download the HTML to a file and return the path

blythed commented 3 weeks ago

How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered. https://github.com/SuperDuperDB/superduperdb/blob/fc9184c41920a04e4370333d710eb6c10bc866ae/superduperdb/base/datalayer.py#L717

blythed commented 3 weeks ago

Another thing - this won't work if the datatypes are Encodable or Artifact. This seems to be in the paradigm of File.

jieguangzhou commented 3 weeks ago

How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered.

https://github.com/SuperDuperDB/superduperdb/blob/fc9184c41920a04e4370333d710eb6c10bc866ae/superduperdb/base/datalayer.py#L717

It is no longer treated as a task. For example, when you use data = HttpPredownload("https://superduperdb.com/xxx").dict(), the data has already been downloaded.

then we got the real file path by calling data['x']

or get the real file path by calling HttpPredownload.encode_data("https://superduperdb.com/xxx")

jieguangzhou commented 3 weeks ago

Another thing - this won't work if the datatypes are Encodable or Artifact. This seems to be in the paradigm of File.

We can also use this logic, depending on how we want to save the specific data, whether as a file or binary data.

blythed commented 3 weeks ago

How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered. https://github.com/SuperDuperDB/superduperdb/blob/fc9184c41920a04e4370333d710eb6c10bc866ae/superduperdb/base/datalayer.py#L717

It is no longer treated as a task. For example, when you use data = HttpPredownload("https://superduperdb.com/xxx").dict(), the data has already been downloaded.

then we got the real file path by calling data['x']

or get the real file path by calling HttpPredownload.encode_data("https://superduperdb.com/xxx")

But then we miss all of the benefits of the current downloading-task; the multi-threading, the multi-processing.

blythed commented 3 weeks ago

How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered. https://github.com/SuperDuperDB/superduperdb/blob/fc9184c41920a04e4370333d710eb6c10bc866ae/superduperdb/base/datalayer.py#L717

It is no longer treated as a task. For example, when you use data = HttpPredownload("https://superduperdb.com/xxx").dict(), the data has already been downloaded. then we got the real file path by calling data['x'] or get the real file path by calling HttpPredownload.encode_data("https://superduperdb.com/xxx")

But then we miss all of the benefits of the current downloading-task; the multi-threading, the multi-processing.

We would need to incorporate this into .execute() or into the cursor?

SuperDuperDB / superduperdb

Delete all logic related to downloads. #2158

The logic when pre_download is True and encodable as File

The logic when pre_download is False:

RemoteData

download module

Example

Pre Download

No Pre Download