Added aws s3 data node - Githubissues

Forchapeatl commented 11 months ago

Added AWS S3 DataNode

Fixes Avaiga/taipy#536

[x] Implemented a new predefined storage type "s3" in DataNodeConfig, to model data stored on an S3 bucket.
[x] List and design the configuration attributes required.
- Create S3 datanode.
- Read stream object from s3 bucket.
- Write stream object to s3 bucket.
[x] Implemented the corresponding S3DataNode inheriting from the parent DataNode class.

Todo

[ ] Add reference manual documentation.
[ ] Write file to s3 bucket

Forchapeatl commented 11 months ago

Hello @toan-quach and @jrobinAV , please permit me to tag you directly. For some reason, I can't seem to request for a review. Please have a look at my PR.

toan-quach commented 11 months ago

@Forchapeatl awesome!! Thanks for tagging us! We will review as soon as we can :D Thanks again for your help!

jrobinAV commented 11 months ago

Hello @Forchapeatl.

Thank you very much for your contribution. For information, we are currently working on a significant repository refactoring. We are merging sub-repositories into a single repository. For instance, the taipy-core repository is about to be merged into the taipy repository.

This should not impact your work.

But just so you are not surprised, instead of merging your PR, we will move your code from this branch (in taipy-core) into a new branch of another repository (taipy). We will let you know and let you review the code.

Thank you for your understanding.

Forchapeatl commented 11 months ago

Hello @jrobinAV thank you for your kind review. The requested changes have been made. The S3ObjectDataNode has been implemented and tested with the _read and _write fuctions in thier expected signatures.

The issue now is the _append method. I had a chat with @trgiangdo concerning this method. I saw how this was possible when we were adding files to a single bucket. Now our DataNode is making reference to a single object not bucket. Handling this method should be possible by copying all the data to a new object, appendng the new content and writing back to S3 , but I don't think it's worth the effort in cases where the object data (file) is very large ( I may be wrong ) . Please for your opinion.

Reference: Append Data to S3 Object

trgiangdo commented 11 months ago

Thank you for your information. _append() is an optional method, so if S3 does not support appending, then it's totally fine.

trgiangdo commented 11 months ago

About the DataNodeConfig, you probably want to take a look at this documentation page.

Then, you can take a look at the src/taipy/core/config/data_node_config.py file, where we define different configuring methods for different data node types, and implement a new configure_s3_data_node() method.

For now, the todo list:

[ ] Implement configure_s3_data_node() method to the DataNodeConfig class
[ ] Add _STORAGE_TYPE_VALUE_S3 attribute and add that attribute to the _ALL_STORAGE_TYPES list.
[ ] Update DataNodeConfig docstring.
[ ] Add configuration tests for S3DataNode that created from an s3 DataNodeConfig.

jrobinAV commented 11 months ago

Hello @jrobinAV thank you for your kind review. The requested changes have been made. The S3ObjectDataNode has been implemented and tested with the _read and _write fuctions in thier expected signatures.

The issue now is the _append method. I had a chat with @trgiangdo concerning this method. I saw how this was possible when we were adding files to a single bucket. Now our DataNode is making reference to a single object not bucket. Handling this method should be possible by copying all the data to a new object, appendng the new content and writing back to S3 , but I don't think it's worth the effort in cases where the object data (file) is very large ( I may be wrong ) . Please for your opinion.

Reference: Append Data to S3 Object

Hello,

Yes, I agree with you. Let's forget abt the append method. It is an optional method anyway.

Thank you so much.

Forchapeatl commented 11 months ago

Hello everyone, the requested changes have been made. This includes.

[x] Implement configure_s3_object_data_node() method to the DataNodeConfig class
[x] Add _STORAGE_TYPE_VALUE_S3 attribute and add that attribute to the _ALL_STORAGE_TYPES list.
[x] Add configuration tests for S3ObjectDataNode that created from an s3_object DataNodeConfig.
[x] Update DataNodeConfig docstring.( I don't really know what this means but i believe it's been covered ).

Please let me know if more changes are required.

Forchapeatl commented 11 months ago

#tests3.py
import taipy as tp
import boto3
from taipy import Config, Core

def build_message(s3_object,name):
    return f"Hello {name}!"

s3_object_cfg = Config.configure_s3_object_data_node(
    id="test_s3_object",
    aws_access_key="ENTER YOUR AWS ACCESS_KEY",
    aws_secret_access_key="ENTER YOUR AWS ACCESS_SECRET_KEY",
    aws_s3_bucket_name="ENTER YOUR BUCKET NAME", #Must be Already existing bucket
    aws_s3_object_key="taipy_object")

input_name_data_node_cfg = Config.configure_data_node(id="input_name")
message_data_node_cfg = Config.configure_data_node(id="message")
build_msg_task_cfg = Config.configure_task("build_msg", build_message, input=[s3_object_cfg,input_name_data_node_cfg], output=[message_data_node_cfg])
scenario_cfg = Config.configure_scenario("scenario", task_configs=[build_msg_task_cfg])

if __name__ == "__main__":

    Core().run()
    hello_scenario = tp.create_scenario(scenario_cfg)
    myS3ObjectDataNode = hello_scenario.test_s3_object
    myS3ObjectDataNode.write("This is nice , Hello , HI , WHAT? ,WHEN! Who?")
    ans = myS3ObjectDataNode.read()
    print(ans)

    hello_scenario.input_name.write("Taipy")
    hello_scenario.submit()

Hello everyone, things seem to be woring fine now. The S3ObjectDataNode does get created. However it seems to be failing storage type test.

Is this a file or variablecaplog.text ? Please for more information on this. My tests failes on these lines: https://github.com/Forchapeatl/taipy-core/blob/57b29061626186aa4839528f8eaf81baac15cf5e/tests/core/test_core.py#L35 https://github.com/Forchapeatl/taipy-core/blob/57b29061626186aa4839528f8eaf81baac15cf5e/tests/core/config/test_data_node_config.py#L132 https://github.com/Forchapeatl/taipy-core/blob/57b29061626186aa4839528f8eaf81baac15cf5e/tests/core/config/checkers/test_data_node_config_checker.py#L94

Please bear with me on the tests3.py code sample. The docs seems advanced for me.

trgiangdo commented 11 months ago

The failed tests are because you updated the error message with new storage_type, but the test expected_error_message on the tests are not updated.

A small update should fix the problem.

Forchapeatl commented 11 months ago

The failed tests are because you updated the error message with new storage_type, but the test expected_error_message on the tests are not updated.

A small update should fix the problem.

Wow, thankyou @trgiangdo , I managed to pass 2 more test, any hints on this file ?

trgiangdo commented 11 months ago

I believe it has the same problem. A small update on the expected_error_message should fix it.

Forchapeatl commented 11 months ago

Passed all 1108 tests. I belive s3ObjectDataNode has been succesfully added and tested.

trgiangdo commented 11 months ago

The tests are failing:

On installation tests:

File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/taipy/core/data/aws_s3.py", line 13, in <module>
import boto3
ModuleNotFoundError: No module named 'boto3'

On backend tests:

>       assert expected_error_message in caplog.text
E       assert '`storage_type` field of DataNodeConfig `data_nodes` must be either csv, sql_table, sql, mongo_collection, pickle, excel, generic, json, parquet, s3_object, or in_memory. Current value of property `storage_type` is "bar".' in 'ERROR    Taipy:config.py:237 `storage_type` field of DataNodeConfig `data_nodes` must be either csv, sql_table, sql, ..., pickle, excel, generic, json, parquet, in_memory, or s3_object. Current value of property `storage_type` is "bar".\n'
E        +  where 'ERROR    Taipy:config.py:237 `storage_type` field of DataNodeConfig `data_nodes` must be either csv, sql_table, sql, ..., pickle, excel, generic, json, parquet, in_memory, or s3_object. Current value of property `storage_type` is "bar".\n' = <_pytest.logging.LogCaptureFixture object at 0x118d6c970>.text

The parquet, s3_object, or in_memory is different from parquet, in_memory, or s3_object

Can you check again?

trgiangdo commented 11 months ago

The tests are up and running. That's great 😁

Can you take a look at the linter error in https://github.com/Avaiga/taipy-core/actions/runs/7102562575/job/19333215991?pr=828? This is so that we have consistent code style in our project.

Forchapeatl commented 11 months ago

The tests are up and running. That's great 😁

Thank you. The code should be working fine now.

Forchapeatl commented 11 months ago

Well :) this is unexpected , @trgiangdo , I did run the pylint test as you said . I am abit reluctant to fix this error because , I am unable to reproduce the error on my environment , let me know what you think.

trgiangdo commented 11 months ago

About the error

Running: pycodestyle --max-line-length=120 --ignore=E121,E123,E126,E226,E24,E704,W503,W504,E203 .
./tests/core/data/test_aws_s3_data_node.py:56:62: E231 missing whitespace after ','
./tests/core/data/test_aws_s3_data_node.py:76:61: E231 missing whitespace after ','

You just need to add a space after the comma in line 56 and line 76 and it should works

trgiangdo commented 11 months ago

Hello @Forchapeatl.

FYI, this taipy-core repository is merged into the taipy repository and is going to be archived soon.

So, instead of merging your PR, can you create a new PR in the main taipy repository, so you can be counted as an official Taipy contributor by Github? The content of the PR should be very similar.

trgiangdo commented 11 months ago

Migrated to https://github.com/Avaiga/taipy/pull/585

Avaiga / taipy-core

Added aws s3 data node #828

Added AWS S3 DataNode

Todo