ballerina-platform / ballerina-library

The Ballerina Library
https://ballerina.io/learn/api-docs/ballerina/
Apache License 2.0
136 stars 64 forks source link

Proposal: Add data binding support for file retrieval in FTP client #6529

Open NipunaRanasinghe opened 6 months ago

NipunaRanasinghe commented 6 months ago

Summary

The current implementation of the get() method in the Ballerina FTP connector allows fetching file contents only as a stream of bytes. This proposal suggests enhancing the get() method to support direct data binding to JSON, XML, and byte arrays, reducing the need for manual conversions and stream handling.

Goals

  1. Enhance the get() method to support direct data binding, allowing it to return file contents directly as JSON, XML, or byte arrays.

Non-Goals

  1. This proposal does not aim to remove the existing return type of the get() method, which returns file contents as a stream of bytes. All the newly introduced return types will be added as subtypes of a union type, to be backward compatible.

  2. This proposal does not aim to support additional data formats beyond JSON, XML, and byte arrays (e.g., CSV).

Motivation

Currently, fetching and converting file contents using the FTP connector involves multiple steps and explicit stream handling, which can be cumbersome and error-prone. By enabling direct data binding, we aim to provide a more streamlined and efficient way to handle file contents.

Description

The proposed changes will enhance the get() method in the FTP connector to automatically infer and bind the data type based on the expected return type. This will be accomplished by overloading the get() method to accept a type descriptor as an optional parameter that dictates the expected format of the data.

Proposed Changes to the get() Method

# Fetches file content from the FTP server.
# + path - The path to the file on the FTP server
# + returnType - The expected return type (JSON, XML, byte[], or stream<byte[], Error?>)
# + return - The file content in the specified format or, an `ftp:Error`
remote isolated function get(string path, typedesc<json|xml|byte[]|stream<byte[], Error?>> returnType = <>) returns returnType|Error;

Examples

  1. Fetching content directly as JSON:

    json jsonContent = check ftpClient->get("/path/to/file.json");
  2. Fetching content directly as XML:

    xml xmlContent = check ftpClient->get("/path/to/file.xml");
  3. Fetching content directly as byte array:

    byte[] byteContent = check ftpClient->get("/path/to/file");
  4. Fetching content as a stream of byte arrays:

    stream<byte[], ftp:Error?> byteStream = check ftpClient->get("/path/to/file");

Compatibility and Migration

  1. Existing code using the get() method with the default behavior (stream of bytes) will continue to work without modifications.
  2. New code can take advantage of the enhanced capabilities by specifying the desired return type.

Risks and Assumptions

  1. The correct format of the file content is assumed based on the provided returnType. Incorrect assumptions may lead to runtime errors.
  2. Adequate error handling should be implemented to manage data conversion errors effectively.
NipunaRanasinghe commented 6 months ago

Related to https://github.com/ballerina-platform/ballerina-library/issues/4234

ayeshLK commented 6 months ago

@NipunaRanasinghe did we think about the performance aspect of this. Lets say we have a large file (may be a JSON or a CSV) and we are trying to get it and data bind it to JSON (or CSV) what would be the behaviour ?

Better to check the behaviour of Ballerina IO package when we try to read a large file.

NipunaRanasinghe commented 6 months ago

@ayeshLK Your concerns regarding large files is certainly valid. However, it's important to note that the proposed approach includes both byte array streams and, direct data types such as JSON, XML, etc as the expected type. Therefore this approach gives users the flexibility to make a conscious decision based on their application needs. (If they are dealing with relatively small files, direct data binding would be convenient and appropriate. Conversely, for larger files, they could opt for streaming to manage resource usage more effectively).

Nevertheless, we can do a performance comparison between the two approaches after implementing the suggested changes. This will help us ensure that the new functionality can be used efficiently across different scenarios.

daneshk commented 6 months ago

@NipunaRanasinghe Shall we have a separate API for streaming like in IO API[1]? Ask the user to specify the block size and read it as a stream.

public isolated function getBlockAsStream(string path, int blockSize) returns stream<Block, Error?>;

# The read-only byte array that is used to read the byte content from the streams.
public type Block readonly & byte[];
  1. https://github.com/ballerina-platform/module-ballerina-io/blob/master/ballerina/file_byte_io.bal#L34
daneshk commented 6 months ago

What would the usage look like to read CSV files from the FTP server? CSV file read/write is a common usecase.

NipunaRanasinghe commented 6 months ago

@NipunaRanasinghe Shall we have a separate API for streaming like in IO API[1]? Ask the user to specify the block size and read it as a stream.

@daneshk here the stream capability is kept under the same API, mainly to the keep the backward compatibility (so that the existing code will not break and users can optionally switch to infer the types, only if they want).

If the backward compatibility is not a priority, I'm +1 to move the streaming capability into a new API.

NipunaRanasinghe commented 6 months ago

What would the usage look like to read CSV files from the FTP server? CSV file read/write is a common usecase.

Here the CSV support is not prioratised (as mentioned in non-goals), as Ballerina still lack of first-class support for CSV data types (unlike json and xml).

Therefore one option would be to go for a cleaner implementation once we we have a better support for CSV data type via the data.csvdata library(ETA - Update10).

But if this should be prioritized, I'm +1 to consider the CSV support (with the representation of string[][]) with the same effort and, to revisit once after we have proper CSV support.

dilanSachi commented 6 months ago

When considering about the return types json|xml|byte[], we will support subtypes of json as well right? i.e a file containing a json array will map to Person[].

NipunaRanasinghe commented 6 months ago

As per the design discussion with @shafreenAnfar, @daneshk and @dilanSachi, we've decided to,

NipunaRanasinghe commented 5 months ago

Please find the new proposal below, which includes the additional requirements mentioned above.

Summary

The current implementation of the get() method in the Ballerina FTP client allows fetching and uploading file contents only as streams of bytes. This proposal suggests introducing a set of new APIs to support direct data binding to common data formats such as text, JSON, XML, byte arrays and CSV, reducing the need for manual conversions and stream handling.

Goals

  1. Introduce new APIs to support direct data binding, allowing it to return and upload file contents directly as text, JSON, XML, byte arrays, CSV, and streams.
  2. Introduce new APIs that support direct data binding from and to Ballerina record types.

Non-Goals

  1. This proposal does not aim to remove the existing get() and put() methods that return and upload file contents as streams of bytes. The existing methods will remain for backward compatibility and, expected to be removed in a future release.

Motivation

Currently, fetching, converting, and uploading file contents using the FTP connector involves multiple steps and explicit stream handling, which can be cumbersome. By enabling direct data binding through specific APIs, we aim to provide a more streamlined and efficient way to handle file contents.

Description

The proposed changes will introduce new methods in the FTP connector, each tailored to fetch and bind file contents to specific data types directly. Similarly, new methods for uploading contents directly as various data types will be introduced.

Proposed New APIs to replace get() method

  1. Fetch content as plain text

    # Fetches file content from the FTP server as plain text.
    # + path - The path to the file on the FTP server
    # + return - The file content as JSON or, an `ftp:Error`
    remote isolated function getText(string path) returns string|Error;
  2. Fetch content as JSON

    # Fetches file content from the FTP server as JSON.
    # + path - The path to the file on the FTP server
    # + return - The file content as JSON or, an `ftp:Error`
    remote isolated function getJson(string path) returns json|Error;
  3. Fetch content as XML

    # Fetches file content from the FTP server as XML.
    # + path - The path to the file on the FTP server
    # + return - The file content as XML or, an `ftp:Error`
    remote isolated function getXml(string path) returns xml|Error;
  4. Fetch content as a byte array

    # Fetches file content from the FTP server as a byte array.
    # + path - The path to the file on the FTP server
    # + return - The file content as a byte array or, an `ftp:Error`
    remote isolated function getBytes(string path) returns byte[]|Error;
  5. Fetch content as a stream of byte arrays

    # Fetches file content from the FTP server as a stream of byte arrays.
    # + path - The path to the file on the FTP server
    # + blockSize - size of a single block to be fetched
    # + return - The file content as a stream of byte arrays or, an `ftp:Error`
    remote isolated function getBytesAsStream(string path, int blockSize) returns stream<readonly & byte[], Error?>|Error;
  6. Fetch content as CSV

    # Fetches file content from the FTP server as CSV.
    # + path - The path to the file on the FTP server
    # + return - The file content as a 2D string array (CSV) or, an `ftp:Error`
    remote isolated function getCsv(string path) returns string[][]|Error;
  7. Fetch content as a CSV stream

    # Fetches file content from the FTP server as a stream of CSV entries(i.e., rows).
    # + path - The path to the file on the FTP server
    # + return - The file content as a stream of string arrays or, an `ftp:Error`
    remote isolated function getCsvAsStream(string path) returns stream<readonly & string[], Error?>|Error;
  8. Fetch JSON content as a Ballerina record

    # Fetches JSON file content from the FTP server as a Ballerina record.
    # + path - The path to the file on the FTP server
    # + returnType - The expected Ballerina record type
    # + return - The file content as the specified Ballerina record type or, an `ftp:Error`
    remote isolated function getJsonWithType(string path, typedesc<T> returnType = <>) returns returnType|Error;
  9. Fetch XML content as a Ballerina record

    # Fetches XML file content from the FTP server as a Ballerina record.
    # + path - The path to the file on the FTP server
    # + returnType - The expected Ballerina record type
    # + return - The file content as the specified Ballerina record type or, an `ftp:Error`
    remote isolated function getXmlWithType(string path, typedesc<T> returnType = <>) returns returnType|Error;
  10. Fetch CSV content as a Ballerina record

    # Fetches CSV file content from the FTP server as a Ballerina record.
    # + path - The path to the file on the FTP server
    # + returnType - The expected Ballerina record type
    # + return - The file content as the specified Ballerina record type or, an `ftp:Error`
    remote isolated function getCsvWithType(string path, typedesc<T> returnType = <>) returns returnType[]|Error;

Proposed New APIs to replace put() method

  1. Upload content as text

    # Uploads file content to the FTP server directly as plain text.
    # + path - The path to the file on the FTP server
    # + content - The file content as a string
    # + return - An `ftp:Error` if an error occurs, otherwise nil
    remote isolated function putText(string path, string content) returns Error?;
  2. Upload content as JSON or Ballerina record

    # Uploads file content to the FTP server directly as JSON or, the counterpart Ballerina record representation.
    # + path - The path to the file on the FTP server
    # + content - The file content as JSON or a Ballerina record
    # + return - An `ftp:Error` if an error occurs, otherwise nil
    remote isolated function putJson(string path, json|record {|anydata...;|}  content) returns Error?;
  3. Upload content as XML or Ballerina record

    # Uploads file content to the FTP server directly as XML or, the counterpart Ballerina record representation.
    # + path - The path to the file on the FTP server
    # + content - The file content as XML or a Ballerina record
    # + return - An `ftp:Error` if an error occurs, otherwise nil
    remote isolated function putXml(string path, xml|record {|anydata...;|}  content) returns Error?;
  4. Upload content as CSV

    # Uploads file content to the FTP server as CSV or, the counterpart Ballerina record representation.
    # + path - The path to the file on the FTP server
    # + content - The file content as a 2D string array (CSV)
    # + return - An `ftp:Error` if an error occurs, otherwise nil
    remote isolated function putCsv(string path, string[][]|record {|anydata...;|}  content) returns Error?;
  5. Upload content as a byte array

    # Uploads file content to the FTP server as a byte array.
    # + path - The path to the file on the FTP server
    # + content - The file content as a byte array
    # + return - An `ftp:Error` if an error occurs, otherwise nil
    remote isolated function putBytes(string path, byte[] content) returns Error?;
  6. Upload content as a stream of byte arrays

    # Uploads file content to the FTP server as a stream of byte arrays.
    # + path - The path to the file on the FTP server
    # + content - The file content as a stream of byte arrays
    # + return - An `ftp:Error` if an error occurs, otherwise nil
    remote isolated function putBytesFromStream(string path, stream<byte[], Error?> content) returns Error?;
  7. Upload content as a CSV stream

    # Uploads file content to the FTP server as a stream of string[].
    # + path - The path to the file on the FTP server
    # + content - The file content as a stream of string arrays
    # + return - An `ftp:Error` if an error occurs, otherwise nil
    remote isolated function putCsvFromStream(string path, stream<string[], Error?> content) returns Error?;

Examples

  1. Fetching content

    • Fetching content directly as plain text:
      string stringContent = check ftpClient->getText("/path/to/file.txt");
    • Fetching content directly as JSON:
      json jsonContent = check ftpClient->getJson("/path/to/file.json");
    • Fetching content directly as XML:
      xml xmlContent = check ftpClient->getXml("/path/to/file.xml");
    • Fetching content directly as a byte array:
      byte[] byteContent = check ftpClient->getBytes("/path/to/file");
    • Fetching content as a stream of byte arrays:
      stream<readonly & byte[], ftp:Error?> byteStream = check ftpClient->getBytesAsStream("/path/to/file", 1024);
    • Fetching content directly as CSV:
      string[][] csvContent = check ftpClient->getCsv("/path/to/file.csv");
    • Fetching content as a stream of CSV rows:
      stream<string[], ftp:Error?> csvStream = check ftpClient->getCsvAsStream("/path/to/file.csv");
    • Fetching JSON content as a Ballerina record:
      Person person = check ftpClient->getJsonWithType("/path/to/file.json");
    • Fetching XML content as a Ballerina record:
      Person person = check ftpClient->getXmlWithType("/path/to/file.xml");
    • Fetching CSV content as a Ballerina record:
      Person[] person = check ftpClient->getCsvWithType("/path/to/file.csv");
  2. Uploading content

    • Uploading content directly as a byte array:
      check ftpClient->putText("/path/to/file", "{ \"name\": \"John\", \"age: 30 }");
    • Uploading content directly as JSON:

      // 1. direct JSON content
      check ftpClient->putJson("/path/to/file.json", { "name": "John", "age: 30 });
      
      // 2. counterpart Ballerina record
      Person person = { name: "John", age: 30 };
      check ftpClient->putJson("/path/to/file.json", person);
    • Uploading content directly as XML:

      // 1. direct XML content
      check ftpClient->putXml("/path/to/file.xml", xml `<person><name>John</name><age>30</age></person>`);
      
      // 2. counterpart Ballerina record
      Person person = { name: "John", age: 30 };
      check ftpClient->putXml("/path/to/file.json", person);
    • Uploading content directly as a byte array:
      check ftpClient->putBytes("/path/to/file", [104, 101, 108, 108, 111]);
    • Uploading content as a stream of byte arrays:
      stream<Block, ftp:Error?> byteStream = ... // define the stream
      check ftpClient->putBytesFromStream("/path/to/file", byteStream);
    • Uploading content directly as CSV:

      // 1. direct CSV content
      check ftpClient->putCsv("/path/to/file.csv", [["name", "age"], ["John", "30"]]);
      
      // 2. counterpart Ballerina record
      Person person = { name: "John", age: 30 };
      check ftpClient->putCsv("/path/to/file.csv", person);
    • Uploading content as a stream of CSV rows:
      stream<string[], ftp:Error?> csvStream = ... // define the stream
      check ftpClient->putCsvFromStream("/path/to/file.csv", csvStream);

Compatibility and Migration

  1. Existing code using the get() and put() methods with the default behavior (stream of bytes) will continue to work without modifications.
  2. New code can take advantage of the new APIs for enhanced capabilities.

Risks and Assumptions

  1. The correct format of the file content is assumed based on the API used. Incorrect assumptions may lead to runtime errors.
  2. Adequate error handling should be implemented to manage data conversion errors effectively.
NipunaRanasinghe commented 5 months ago

@daneshk @shafreenAnfar appreciate your reviews/feedback on the revised proposal added in https://github.com/ballerina-platform/ballerina-library/issues/6529?notification_referrer_id=NT_kwDOAbsAmLQxMDYzNTYyMjQxMzoyOTAzMjYwMA#issuecomment-2132589005.

daneshk commented 5 months ago

There are a few minor queries. Overall LGTM

remote isolated function getJsonWithType(string path, typedesc<T> returnType = <>) returns returnType|Error;
shafreenAnfar commented 5 months ago

I wonder if we should we suffix file to all APIs as follows.

remote isolated function fileGetCsv(string path) returns string[][]|Error;
NipunaRanasinghe commented 5 months ago

Are we supporting inferring the return type in get*WithType() functions?

Yes we are. There was a mistakenly added syntax issue in the proposed *withType() APIs and, I have updated them accordingly. Thanks for pointing that out.

getCSVWithType needs to return returnType[] array instead of single value.

Yes, updated the proposal with the suggestion.

Shall we support reading text files as string content?

If we require to support that scenario (in addition to retrieving the string content as a byte array), we have the option to add a new API called getString() or, getText(). There I prefer to proceed with the name getText(), as a API named getString() might sound ambiguous (as JSON, CSV and XML datatypes can also be parsed into strings) compared to the other option.

@daneshk Please let me know your thoughts on this and, I'll update the proposal once we finalize the API name.

NipunaRanasinghe commented 5 months ago

I wonder if we should we suffix file to all APIs as follows.

remote isolated function fileGetCsv(string path) returns string[][]|Error;

@shafreenAnfar Yes we have that option too. My only concern is whether this naming convention might deviate from the existing naming conventions we have for our other FTP APIs (get, append, put, mkdir, rmdir, rename, size, list, isDirectory, delete). It seems like we have considered that having a file prefix/suffix is redundant since the FTP client itself is related to file transfer protocol (hence the DRY rule).

So if we are to do this, shouldn't we apply the same convention to the other APIs as well (e.g. rename -> fileRename, delete->fileDelete)?

WDYT?

shafreenAnfar commented 5 months ago

Interesting! I was thinking two types of APIs; one set as low level APIs (what we have now) and another set as high level APIs.

daneshk commented 5 months ago

Are we supporting inferring the return type in get*WithType() functions?

Yes we are. There was a mistakenly added syntax issue in the proposed *withType() APIs and, I have updated them accordingly. Thanks for pointing that out.

getCSVWithType needs to return returnType[] array instead of single value.

Yes, updated the proposal with the suggestion.

Shall we support reading text files as string content?

If we require to support that scenario (in addition to retrieving the string content as a byte array), we have the option to add a new API called getString() or, getText(). There I prefer to proceed with the name getText(), as a API named getString() might sound ambiguous (as JSON, CSV and XML datatypes can also be parsed into strings) compared to the other option.

@daneshk Please let me know your thoughts on this and, I'll update the proposal once we finalize the API name.

Yes, the getText() is preferred as we are going to use this API to read .txt files. I am +1

shafreenAnfar commented 5 months ago

On second thoughts, I think we can skip the file and go ahead with what @NipunaRanasinghe suggested.

NipunaRanasinghe commented 5 months ago

On second thoughts, I think we can skip the file and go ahead with what @NipunaRanasinghe suggested.

Ack @shafreenAnfar. Will proceed with the exiting design. Thanks for your valuable feedback so far on this.

NipunaRanasinghe commented 5 months ago

The implementation is currently on hold due to #42830 and #42848 release blockers.