Closed leifulstrup closed 3 years ago
df_test_lines = pd.read_csv(s3_location, nrows=10) df_test_lines.head()
Works fine but the ddf.head() that points to the same s3 CSV file throws the error above
@leifulstrup thank you for the report! It looks from your initial VersionMismatchWarning
that you don't have blosc
and lz4
installed locally; can you install those into your environment and try again? We're working on a better error message!
@necaris that worked. I had to restart the jupyter notebook on my local machine. After seeing your note, I realize it says it is missing blosc and lz4. I created a new conda env to test coiled. Also, I tried accessing the file using https instead of s3 and it failed due to lack of requests package.
One thing that was not clear to me before your note was that the warning was alerting me to a mismatch between what was on my local machine (client) and the server (scheduler in your columns). That makes sense now but the architecture of how the local is interacting with the cluster was unclear. I have only set up clusters on my local network of machines and launched the scheduler and workers on one of those machines.
coiled seems like a huge advanced in abstracting the computing in the cloud. Doing the same on storage would be great too.
Looking forward to experimenting with this more.
I am testing this with data stored in S3. Are there ways to estimate what types of charges I may incur using S3 as my data source and the way that dask reads data? I am concerned about racking up a large S3 data out bill?
Kudos to you and the team working on this.
@necaris that worked. I had to restart the jupyter notebook on my local machine. After seeing your note, I realize it says it is missing blosc and lz4. I created a new conda env to test coiled. Also, I tried accessing the file using https instead of s3 and it failed due to lack of requests package.
One thing that was not clear to me before your note was that the warning was alerting me to a mismatch between what was on my local machine (client) and the server (scheduler in your columns). That makes sense now but the architecture of how the local is interacting with the cluster was unclear. I have only set up clusters on my local network of machines and launched the scheduler and workers on one of those machines.
Thank you for the feedback -- we'll have to spend some more time on the documentation there! In the meantime, if you'd like to be sure you have the same environment locally that's running on the cluster, please check out the detailed documentation for software environments at https://docs.coiled.io/user_guide/software_environment_local.html :smile:
Looking forward to experimenting with this more.
Looking forward to more feedback!
I am testing this with data stored in S3. Are there ways to estimate what types of charges I may incur using S3 as my data source and the way that dask reads data? I am concerned about racking up a large S3 data out bill?
Are these S3 buckets that you own? I am not an S3 expert but my understanding is that unless they are buckets you own and you have them set so that you will pay for transfers, then the costs are on the requester (i.e. us). As Coiled is currently free for beta users, you should have nothing to worry about!
However, this would be a great thing to add to our complete cost estimation feature post-beta, so thank you for that reminder!
AWS does not charge for data access if you are within AWS, so this is likely free for you.
On Tue, Sep 29, 2020 at 1:41 PM Rami Chowdhury notifications@github.com wrote:
@necaris https://github.com/necaris that worked. I had to restart the jupyter notebook on my local machine. After seeing your note, I realize it says it is missing blosc and lz4. I created a new conda env to test coiled. Also, I tried accessing the file using https instead of s3 and it failed due to lack of requests package.
One thing that was not clear to me before your note was that the warning was alerting me to a mismatch between what was on my local machine (client) and the server (scheduler in your columns). That makes sense now but the architecture of how the local is interacting with the cluster was unclear. I have only set up clusters on my local network of machines and launched the scheduler and workers on one of those machines.
Thank you for the feedback -- we'll have to spend some more time on the documentation there! In the meantime, if you'd like to be sure you have the same environment locally that's running on the cluster, please check out the detailed documentation for software environments at https://docs.coiled.io/user_guide/software_environment_local.html 😄
Looking forward to experimenting with this more.
Looking forward to more feedback!
I am testing this with data stored in S3. Are there ways to estimate what types of charges I may incur using S3 as my data source and the way that dask reads data? I am concerned about racking up a large S3 data out bill?
Are these S3 buckets that you own? I am not an S3 expert but my understanding is that unless they are buckets you own and you have them set so that you will pay for transfers, then the costs are on the requester (i.e. us). As Coiled is currently free for beta users, you should have nothing to worry about!
However, this would be a great thing to add to our complete cost estimation feature post-beta, so thank you for that reminder!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-issues/issues/78#issuecomment-700974353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBOIDMZ2BYZ7O3WEN3SIJA7JANCNFSM4R6HPTZQ .
@mrocklin thank you. The coiled servers are running on AWS? Good to know. I uploaded the CSV files to S3 to test coiled. Very impressed. I'll do so more testing over the next week and add a Medium post featuring it. I have been promoting Dask in these:
Correct. Currently we're launching machines on AWS, and try to make interacting with data there a pleasant experience.
I'm glad to hear about the post. Let us know if we can help or amplify.
On Tue, Sep 29, 2020 at 1:50 PM Leif Ulstrup notifications@github.com wrote:
@mrocklin https://github.com/mrocklin thank you. The coiled servers are running on AWS? Good to know. I uploaded the CSV files to S3 to test coiled. Very impressed. I'll do so more testing over the next week and add a Medium post featuring it. I have been promoting Dask in these:
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-issues/issues/78#issuecomment-700978542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDHZ25HX7KBWX4ZBL3SIJCATANCNFSM4R6HPTZQ .
Hello, I'm going through the open issues on this repository, and I'm closing some of them. It seems that this issue might be solved. We have a Why do I get Version Mismatch warning
on our FAQ perhaps that clarifies why this is happening? 🤔
I'm closing this issue, but please feel free to re-open or create a new issue if you encounter any problems (or if this issue is still happening to you) 😄
created dd.read_csv link to a ~1.8GB CSV file from S3
BTW: also saw this warning after from dask.distributed import Client client = Client(cluster) print('Dashboard:', client.dashboard_link)
/Users/leifulstrup/opt/anaconda3/envs/coiled_env/lib/python3.8/site-packages/distributed/client.py:1130: VersionMismatchWarning: Mismatched versions found
+-------------+--------+-----------+---------+ | Package | client | scheduler | workers | +-------------+--------+-----------+---------+ | blosc | None | 1.9.2 | None | | dask | 2.28.0 | 2.23.0 | None | | distributed | 2.28.0 | 2.25.0 | None | | lz4 | None | 3.1.0 | None | | toolz | 0.11.1 | 0.10.0 | None | +-------------+--------+-----------+---------+ warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
ddf.head() results in: