Dataset Retrieval - Githubissues

hosseinfani commented 1 year ago

We need to provide api to fetch the datasets using a unique id like gensim or bars lib when they read standard datasets with splits.

rezaBarzgar commented 1 year ago

@MarcoKurepa

Loading Dataset Process

The desired steps for the system to load the dataset are as follows:

Initially, the system attempts to load the pre-processed datasets (This step has already been completed).
If the system encounters an error while loading the datasets, it proceeds to download them from the API (This issue page is about this particular step).
In the event that the system fails to download the datasets, it proceeds to generate the preprocessed datasets (This step has already been completed).

The necessary code should be inserted here.

rezaBarzgar commented 1 year ago

@MarcoKurepa Hey Marco!

Just wanted to check in and see how things are going for you. Is everything going smoothly? Give me an update whenever you can.

MarcoKurepa commented 1 year ago

@MarcoKurepa Hey Marco!

Just wanted to check in and see how things are going for you. Is everything going smoothly? Give me an update whenever you can.

Hey Reza, things are going well so far. I am still working on the Kaggle problem. I haven't had much time to work as I have driver's ed this week, but I plan on beginning work on this issue this weekend, it should be done by Monday.

I was also wondering if I could come in to work on Monday around 9?

rezaBarzgar commented 1 year ago

@MarcoKurepa Hey Marco! Just wanted to check in and see how things are going for you. Is everything going smoothly? Give me an update whenever you can.

Hey Reza, things are going well so far. I am still working on the Kaggle problem. I haven't had much time to work as I have driver's ed this week, but I plan on beginning work on this issue this weekend, it should be done by Monday.

I was also wondering if I could come in to work on Monday around 9?

Great to hear that. Good luck with your test :) Yes, I'll be at the lab from 9.

MarcoKurepa commented 1 year ago

Hello Reza, I have configured the workspace on my computer, and am beginning to work on this issue now. what API are we downloading the datasets from?

rezaBarzgar commented 1 year ago

@MarcoKurepa Hi, the datasets are uploaded in Microsoft Teams; I made the whole directory public so everybody can use them. Here is the link (ignore output directory). Every single file must be downloadable. If it doesn’t work out, tell me to create a shareable link for every single file in that directory.

MarcoKurepa commented 1 year ago

@hosseinfani I am trying to access the sharepoint, however I am returned an error requiring a client id and token. This link details the process of creating a client id and token, and Reza also mentioned that we could use google drive. How would you like to proceed?

Here is the code I used:

from office365.sharepoint.client_context import ClientContext

def download_files_from_sharepoint(site_url, local_download_path):

    folder_path = "Shared Documents/Team Formation/OpeNTF0.2.0.0/data/preprocessed"

    # Connect to the SharePoint site
    auth_context = AuthenticationContext(site_url)
    client_context = ClientContext(site_url, auth_context)
    auth_context.acquire_token_for_app(client_id="<client_id>", client_secret="<client_secret>")
    client_context.execute_query()

    target_folder = client_context.web.get_folder_by_server_relative_url(folder_path)
    client_context.load(target_folder)
    client_context.execute_query()

    files = target_folder.files
    client_context.load(files)
    client_context.execute_query()

    # Iterate over the files in the folder and download them
    for file in files:
        local_file_path = "{}/{}".format(local_download_path, file.properties["Name"])
        try:
            with open(local_file_path, "wb") as local_file:
                file.download(local_file)
            print("Downloaded file: {}".format(file.properties["Name"]))
        except Exception as e:
            print("Error downloading file: {}. {}".format(file.properties["Name"], str(e)))

    print("All files have been downloaded successfully.")

site_url = "https://uwin365.sharepoint.com/:f:/s/cshfrg-TeamFormation/Eo_dbQ5f4mJLqYSVn3YCPu4BD4m4k26E6dtN3nu-Uv2_Ww?e=N9EcsU"
local_download_path = "./test"

download_files_from_sharepoint(site_url, local_download_path)

An alternative method could be web scrapping, but that is likely just over complicating the issue.

rezaBarzgar commented 1 year ago

@hosseinfani I am trying to access the sharepoint, however I am returned an error requiring a client id and token. This link details the process of creating a client id and token, and Reza also mentioned that we could use google drive. How would you like to proceed?

Here is the code I used:

from office365.sharepoint.client_context import ClientContext

def download_files_from_sharepoint(site_url, local_download_path):

    folder_path = "Shared Documents/Team Formation/OpeNTF0.2.0.0/data/preprocessed"

    # Connect to the SharePoint site
    auth_context = AuthenticationContext(site_url)
    client_context = ClientContext(site_url, auth_context)
    auth_context.acquire_token_for_app(client_id="<client_id>", client_secret="<client_secret>")
    client_context.execute_query()

    target_folder = client_context.web.get_folder_by_server_relative_url(folder_path)
    client_context.load(target_folder)
    client_context.execute_query()

    files = target_folder.files
    client_context.load(files)
    client_context.execute_query()

    # Iterate over the files in the folder and download them
    for file in files:
        local_file_path = "{}/{}".format(local_download_path, file.properties["Name"])
        try:
            with open(local_file_path, "wb") as local_file:
                file.download(local_file)
            print("Downloaded file: {}".format(file.properties["Name"]))
        except Exception as e:
            print("Error downloading file: {}. {}".format(file.properties["Name"], str(e)))

    print("All files have been downloaded successfully.")

site_url = "https://uwin365.sharepoint.com/:f:/s/cshfrg-TeamFormation/Eo_dbQ5f4mJLqYSVn3YCPu4BD4m4k26E6dtN3nu-Uv2_Ww?e=N9EcsU"
local_download_path = "./test"

download_files_from_sharepoint(site_url, local_download_path)

An alternative method could be web scrapping, but that is likely just over complicating the issue. @hosseinfani As Macro mentioned, downloading from SharePoint needs authentication; however, I found a way that we can download from SharePoint. He is trying to use it. If it doesn't work, I think we need to upload datasets on other storage like Gdrive.

hosseinfani commented 1 year ago

@rezaBarzgar and @MarcoKurepa We created a ticket to the university's IT service to obtain necessary credentials to access sharepoint API

https://uwindsor.teamdynamix.com/TDClient/1975/Portal/Requests/TicketRequests/TicketDet.aspx?TicketID=bM2rC7bfUSodIPTdGScjHw__

MarcoKurepa commented 1 year ago

Preliminary Architecture Diagram (UML): https://lucid.app/lucidchart/f75885be-0012-4a99-9026-1c0486b9ec5d/edit?viewport_loc=110%2C-322%2C3328%2C1582%2C0_0&invitationId=inv_1676815a-49d0-4696-8e6d-e8bcf22f8be3

MarcoKurepa commented 1 year ago

@rezaBarzgar and @MarcoKurepa We created a ticket to the university's IT service to obtain necessary credentials to access sharepoint API

https://uwindsor.teamdynamix.com/TDClient/1975/Portal/Requests/TicketRequests/TicketDet.aspx?TicketID=bM2rC7bfUSodIPTdGScjHw__

I'm unable to access this link, have they gotten back to us yet?

MarcoKurepa commented 1 year ago

Reza Approved 👍 Generate Sparse Matrices Architecture Diagram

hosseinfani commented 1 year ago

@rezaBarzgar and @MarcoKurepa We created a ticket to the university's IT service to obtain necessary credentials to access sharepoint API https://uwindsor.teamdynamix.com/TDClient/1975/Portal/Requests/TicketRequests/TicketDet.aspx?TicketID=bM2rC7bfUSodIPTdGScjHw__

I'm unable to access this link, have they gotten back to us yet?

@MarcoKurepa not yet. but it went to "In Progress" state :)

MarcoKurepa commented 1 year ago

@rezaBarzgar @hosseinfani From the verbose logs, we can gather a few insights:

The initial attempt to connect to uwin365.sharepoint.com resulted in a 401 Unauthorized error. This indicates that the initial request was unauthenticated.
The code then reached out to accounts.accesscontrol.windows.net, which returned a 200 OK status. This suggests that the token was successfully retrieved using the client credentials.
With the token, the code made another request to uwin365.sharepoint.com, specifically to the _api/contextInfo endpoint, but it received a 403 Forbidden response. This indicates that the token doesn't have the right permissions or the SharePoint site is restricting access. The 403 Forbidden error from SharePoint suggests that:

The Azure AD application might not have the necessary permissions to access this SharePoint site or specific resource.
The SharePoint site might have custom permissions or restrictions in place.


import tempfile
import logging
from office365.sharepoint.client_context import ClientContext
from office365.runtime.auth.client_credential import ClientCredential

# Enable detailed logging for 'requests' library
logging.basicConfig(level=logging.DEBUG)

CLIENT_ID = "e89ea504-0eac-4733-a430-1d8320165f73"
CLIENT_SECRET = "rUx8Q~GF5Y6bEwfftwslzjP~4qkIghQYiEAJ8bJf"
TENANT_ID = "12f933b3-3d61-4b19-9a4d-689021de8cc9"
SITE_URL = "https://uwin365.sharepoint.com/s/cshfrg-TeamFormation"

client_credentials = ClientCredential(CLIENT_ID, CLIENT_SECRET)

client = ClientContext(SITE_URL).with_credentials(client_credentials)

sharing_link_url = "https://uwin365.sharepoint.com/:f:/s/cshfrg-TeamFormation/Eo_dbQ5f4mJLqYSVn3YCPu4BD4m4k26E6dtN3nu-Uv2_Ww?e=N9EcsU"

download_path = os.path.join(tempfile.mkdtemp(), "teams.pkl")
with open(download_path, "wb") as local_file:
    file = client.web.get_file_by_guest_url(sharing_link_url).download(local_file).execute_query()
print("[Ok] file has been downloaded into: {0}".format(download_path))

MarcoKurepa commented 1 year ago

It looks like we'll need to ask IT to expand upon the permissions granted to our azure AD application.

hosseinfani commented 1 year ago

@MarcoKurepa I will post this to the IT ticket and will get back to you soon.

hosseinfani commented 1 year ago

@MarcoKurepa @rezaBarzgar I spend the whole night till now to figure out how the sharepoint API works, debug the python library, etc. Here is my understanding:

the python lib is working fine, no bug, also no need for tenant_id. the lib is just a wrapper to call rest apis (building URLs)
tenant_id is another name for sharepoint site for companies. so uwin365.sharepoint.com =?= .sharepoint.com
to work with files, we need to use the /_api/web collection of apis or methods: https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/working-with-folders-and-files-with-rest
the site URL and file URL should be correctly set as below code.
basically, the flow is that (1) sharepoint api needs accessToken. For this, a caller should first ask for it by providing client_id and client_secret. (2) when received the accessToken, a caller should call an api by setting the authentication header attribute to the given accessToken Authorization: Bearer <accessToken> (3) receive the result.
If no login attempt were made or cookies are removed, any browser needs such authentication.
If once login, then we can test the api using browser.

Finally, here is the code that basically says there is a problem in the server side I think. So, basically what @MarcoKurepa found is kind of correct.

https://colab.research.google.com/drive/1LupHn1_7tQ-6K3vNSzqtBdRdJouHKPw-?usp=sharing

from office365.runtime.auth.client_credential import ClientCredential
from office365.runtime.auth.user_credential import UserCredential
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File

# TENANT_ID = "12f933b3-3d61-4b19-9a4d-689021de8cc9"
# site_url = f"https://{TENANT_ID}.sharepoint.com" >> not correct use for the lib
site_url = "https://uwin365.sharepoint.com"

# file_url = "/:u:/r/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/indexes.pkl"
full_url = "https://uwin365.sharepoint.com/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/indexes.pkl"
file_url = "/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/indexes.pkl"
folder_url = '/sites/cshfrg-TeamFormation/Shared%20Documents/Team%20Formation/OpeNTF0.2.0.0/data/preprocessed/dblp/dblp.v12.json/'

client_id = "e89ea504-0eac-4733-a430-1d8320165f73"
client_secret = "rUx8Q~GF5Y6bEwfftwslzjP~4qkIghQYiEAJ8bJf"
client_credentials = ClientCredential(client_id,client_secret)

ctx = ClientContext(site_url).with_credentials(client_credentials)

with open('bk.jpg', "wb") as local_file:
     file = ctx.web.get_file_by_server_relative_url(file_url).download(local_file).execute_query()
     # file = File.from_url(file_url).with_credentials(client_credentials).download(local_file).execute_query()
     # file = ctx.web.get_file_by_guest_url(full_url).download(local_file).execute_query()
     # file = ctx.web.get_file_by_url(full_url).download(local_file).execute_query()

In the browser with logged in history:

https://uwin365.sharepoint.com/_api/Web/getFileByServerRelativeUrl('/sites/cshfrg-TeamFormation/indexes.pkl')?$select=ServerRelativePath,Id

with correct url for api

correct url directly

hosseinfani commented 1 year ago

@3ripleM I need your help in this please

hosseinfani commented 1 year ago

@3ripleM please update your finding here. tnx.

fani-lab / OpeNTF

Dataset Retrieval #191

Loading Dataset Process