Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.56k stars 2.78k forks source link

Upload files from web(python web scraping) to azure blob storage #27678

Closed jairamjidgekar closed 1 year ago

jairamjidgekar commented 1 year ago

Hi,

I have an use case where I would like to upload a file from web (web scraping the .zip files from website). These zip files are huge (>2GB) and after unzipping, the file size increases drastically (>40GB in some cases).

I would like to leverage the Azure blob storage for this using Python and azure connectivity.

My thought process is to scrape the content from the python and process the files using the Apache NiFi and load it to the azure.

Please do let me know if this approach is feasible and can be accomplished.

Thanks

xiangyan99 commented 1 year ago

Thanks for the feedback, we’ll investigate asap.

jalauzon-msft commented 1 year ago

Hi @jairamjidgekar Jairam, thanks for reaching out but generally we use GitHub issues to report/discuss issues or specific questions about the SDK so I'm not sure how much I will be able to help with a high-level design question such as this.

In general, the approach sounds fine, but I am not at all familiar with Apache NiFi so I can't really provide any information about that. This SDK provides a few different APIs that allow you to upload data either all at once or in pieces and Azure Storage should support your larger files without issue. All data uploaded is just binary so Azure Storage will handle your data zipped or unzipped, however you choose to upload it.

Here is a sample on how to use the SDK to upload a file to a block blob. The upload_blob method will automatically split up a large upload into smaller pieces (4 MiB by default) to optimize performance and make sure the network can handle the upload.

Another option that may or may not be useful in your case is copying a file from another web location directly into Azure Storage. Here is a sample of that. There are some nuances around authentication here if your file is not public, but it may still be possible if the file is accessible via OAuth.

Hopefully that helps some. Thanks.

jairamjidgekar commented 1 year ago

Thank you Jacob @jalauzon-msft .

I tried loading the files using python from local machine to upload to Azure. It was able to upload huge files >2GB without any issues.

upload_blob did the trick. I will work on it further and let you know if there are any roadblocks.

Thanks again, Jairam P.