Open hosseinfani opened 1 year ago
@MarcoKurepa
The desired steps for the system to load the dataset are as follows:
The necessary code should be inserted here.
@MarcoKurepa Hey Marco!
Just wanted to check in and see how things are going for you. Is everything going smoothly? Give me an update whenever you can.
@MarcoKurepa Hey Marco!
Just wanted to check in and see how things are going for you. Is everything going smoothly? Give me an update whenever you can.
Hey Reza, things are going well so far. I am still working on the Kaggle problem. I haven't had much time to work as I have driver's ed this week, but I plan on beginning work on this issue this weekend, it should be done by Monday.
I was also wondering if I could come in to work on Monday around 9?
@MarcoKurepa Hey Marco! Just wanted to check in and see how things are going for you. Is everything going smoothly? Give me an update whenever you can.
Hey Reza, things are going well so far. I am still working on the Kaggle problem. I haven't had much time to work as I have driver's ed this week, but I plan on beginning work on this issue this weekend, it should be done by Monday.
I was also wondering if I could come in to work on Monday around 9?
Great to hear that. Good luck with your test :) Yes, I'll be at the lab from 9.
Hello Reza, I have configured the workspace on my computer, and am beginning to work on this issue now. what API are we downloading the datasets from?
@MarcoKurepa Hi, the datasets are uploaded in Microsoft Teams; I made the whole directory public so everybody can use them. Here is the link (ignore output directory). Every single file must be downloadable. If it doesn’t work out, tell me to create a shareable link for every single file in that directory.
@hosseinfani I am trying to access the sharepoint, however I am returned an error requiring a client id and token. This link details the process of creating a client id and token, and Reza also mentioned that we could use google drive. How would you like to proceed?
Here is the code I used:
from office365.sharepoint.client_context import ClientContext
def download_files_from_sharepoint(site_url, local_download_path):
folder_path = "Shared Documents/Team Formation/OpeNTF0.2.0.0/data/preprocessed"
# Connect to the SharePoint site
auth_context = AuthenticationContext(site_url)
client_context = ClientContext(site_url, auth_context)
auth_context.acquire_token_for_app(client_id="<client_id>", client_secret="<client_secret>")
client_context.execute_query()
target_folder = client_context.web.get_folder_by_server_relative_url(folder_path)
client_context.load(target_folder)
client_context.execute_query()
files = target_folder.files
client_context.load(files)
client_context.execute_query()
# Iterate over the files in the folder and download them
for file in files:
local_file_path = "{}/{}".format(local_download_path, file.properties["Name"])
try:
with open(local_file_path, "wb") as local_file:
file.download(local_file)
print("Downloaded file: {}".format(file.properties["Name"]))
except Exception as e:
print("Error downloading file: {}. {}".format(file.properties["Name"], str(e)))
print("All files have been downloaded successfully.")
site_url = "https://uwin365.sharepoint.com/:f:/s/cshfrg-TeamFormation/Eo_dbQ5f4mJLqYSVn3YCPu4BD4m4k26E6dtN3nu-Uv2_Ww?e=N9EcsU"
local_download_path = "./test"
download_files_from_sharepoint(site_url, local_download_path)
An alternative method could be web scrapping, but that is likely just over complicating the issue.
@hosseinfani I am trying to access the sharepoint, however I am returned an error requiring a client id and token. This link details the process of creating a client id and token, and Reza also mentioned that we could use google drive. How would you like to proceed?
Here is the code I used:
from office365.sharepoint.client_context import ClientContext def download_files_from_sharepoint(site_url, local_download_path): folder_path = "Shared Documents/Team Formation/OpeNTF0.2.0.0/data/preprocessed" # Connect to the SharePoint site auth_context = AuthenticationContext(site_url) client_context = ClientContext(site_url, auth_context) auth_context.acquire_token_for_app(client_id="<client_id>", client_secret="<client_secret>") client_context.execute_query() target_folder = client_context.web.get_folder_by_server_relative_url(folder_path) client_context.load(target_folder) client_context.execute_query() files = target_folder.files client_context.load(files) client_context.execute_query() # Iterate over the files in the folder and download them for file in files: local_file_path = "{}/{}".format(local_download_path, file.properties["Name"]) try: with open(local_file_path, "wb") as local_file: file.download(local_file) print("Downloaded file: {}".format(file.properties["Name"])) except Exception as e: print("Error downloading file: {}. {}".format(file.properties["Name"], str(e))) print("All files have been downloaded successfully.") site_url = "https://uwin365.sharepoint.com/:f:/s/cshfrg-TeamFormation/Eo_dbQ5f4mJLqYSVn3YCPu4BD4m4k26E6dtN3nu-Uv2_Ww?e=N9EcsU" local_download_path = "./test" download_files_from_sharepoint(site_url, local_download_path)
An alternative method could be web scrapping, but that is likely just over complicating the issue. @hosseinfani As Macro mentioned, downloading from SharePoint needs authentication; however, I found a way that we can download from SharePoint. He is trying to use it. If it doesn't work, I think we need to upload datasets on other storage like Gdrive.
@rezaBarzgar and @MarcoKurepa We created a ticket to the university's IT service to obtain necessary credentials to access sharepoint API
@rezaBarzgar and @MarcoKurepa We created a ticket to the university's IT service to obtain necessary credentials to access sharepoint API
I'm unable to access this link, have they gotten back to us yet?
Reza Approved 👍
@rezaBarzgar and @MarcoKurepa We created a ticket to the university's IT service to obtain necessary credentials to access sharepoint API https://uwindsor.teamdynamix.com/TDClient/1975/Portal/Requests/TicketRequests/TicketDet.aspx?TicketID=bM2rC7bfUSodIPTdGScjHw__
I'm unable to access this link, have they gotten back to us yet?
@MarcoKurepa not yet. but it went to "In Progress" state :)
@rezaBarzgar @hosseinfani From the verbose logs, we can gather a few insights:
The initial attempt to connect to uwin365.sharepoint.com resulted in a 401 Unauthorized error. This indicates that the initial request was unauthenticated.
The code then reached out to accounts.accesscontrol.windows.net, which returned a 200 OK status. This suggests that the token was successfully retrieved using the client credentials.
With the token, the code made another request to uwin365.sharepoint.com, specifically to the _api/contextInfo endpoint, but it received a 403 Forbidden response. This indicates that the token doesn't have the right permissions or the SharePoint site is restricting access. The 403 Forbidden error from SharePoint suggests that:
The Azure AD application might not have the necessary permissions to access this SharePoint site or specific resource.
The SharePoint site might have custom permissions or restrictions in place.
import tempfile
import logging
from office365.sharepoint.client_context import ClientContext
from office365.runtime.auth.client_credential import ClientCredential
# Enable detailed logging for 'requests' library
logging.basicConfig(level=logging.DEBUG)
CLIENT_ID = "e89ea504-0eac-4733-a430-1d8320165f73"
CLIENT_SECRET = "rUx8Q~GF5Y6bEwfftwslzjP~4qkIghQYiEAJ8bJf"
TENANT_ID = "12f933b3-3d61-4b19-9a4d-689021de8cc9"
SITE_URL = "https://uwin365.sharepoint.com/s/cshfrg-TeamFormation"
client_credentials = ClientCredential(CLIENT_ID, CLIENT_SECRET)
client = ClientContext(SITE_URL).with_credentials(client_credentials)
sharing_link_url = "https://uwin365.sharepoint.com/:f:/s/cshfrg-TeamFormation/Eo_dbQ5f4mJLqYSVn3YCPu4BD4m4k26E6dtN3nu-Uv2_Ww?e=N9EcsU"
download_path = os.path.join(tempfile.mkdtemp(), "teams.pkl")
with open(download_path, "wb") as local_file:
file = client.web.get_file_by_guest_url(sharing_link_url).download(local_file).execute_query()
print("[Ok] file has been downloaded into: {0}".format(download_path))
It looks like we'll need to ask IT to expand upon the permissions granted to our azure AD application.
@MarcoKurepa I will post this to the IT ticket and will get back to you soon.
@MarcoKurepa @rezaBarzgar I spend the whole night till now to figure out how the sharepoint API works, debug the python library, etc. Here is my understanding:
Authorization: Bearer <accessToken>
(3) receive the result. Finally, here is the code that basically says there is a problem in the server side I think. So, basically what @MarcoKurepa found is kind of correct.
https://colab.research.google.com/drive/1LupHn1_7tQ-6K3vNSzqtBdRdJouHKPw-?usp=sharing
from office365.runtime.auth.client_credential import ClientCredential
from office365.runtime.auth.user_credential import UserCredential
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
# TENANT_ID = "12f933b3-3d61-4b19-9a4d-689021de8cc9"
# site_url = f"https://{TENANT_ID}.sharepoint.com" >> not correct use for the lib
site_url = "https://uwin365.sharepoint.com"
# file_url = "/:u:/r/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/indexes.pkl"
full_url = "https://uwin365.sharepoint.com/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/indexes.pkl"
file_url = "/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/indexes.pkl"
folder_url = '/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/'
client_id = "e89ea504-0eac-4733-a430-1d8320165f73"
client_secret = "rUx8Q~GF5Y6bEwfftwslzjP~4qkIghQYiEAJ8bJf"
client_credentials = ClientCredential(client_id,client_secret)
ctx = ClientContext(site_url).with_credentials(client_credentials)
with open('bk.jpg', "wb") as local_file:
file = ctx.web.get_file_by_server_relative_url(file_url).download(local_file).execute_query()
# file = File.from_url(file_url).with_credentials(client_credentials).download(local_file).execute_query()
# file = ctx.web.get_file_by_guest_url(full_url).download(local_file).execute_query()
# file = ctx.web.get_file_by_url(full_url).download(local_file).execute_query()
In the browser with logged in history:
https://uwin365.sharepoint.com/_api/Web/getFileByServerRelativeUrl('/sites/cshfrg-TeamFormation/indexes.pkl')?$select=ServerRelativePath,Id
with correct url for api
correct url directly
@3ripleM I need your help in this please
@3ripleM please update your finding here. tnx.
We need to provide api to fetch the datasets using a unique id like gensim or bars lib when they read standard datasets with splits.