SSHOC / marketplace-curation

Project to manage scripts and auxiliary data, via Python library and Jupyter notebooks, for the curation of the SSH Open Marketplace
0 stars 0 forks source link

Curating URLs #6

Closed dpancic closed 1 year ago

dpancic commented 1 year ago

In GitLab by @laureD19 on Nov 23, 2022, 10:24

This is an umbrella issue to discuss URLs curation, especially the methods developed in the python library to flag broken URLs and the examples provided in the related notebook.

notify @KlausIllmayer @cesareconcordia @aureon249 @kreetrapper

dpancic commented 1 year ago

In GitLab by @laureD19 on Nov 23, 2022, 10:29

Together with Martin, we've finalised the manual curation of 69 items flagged with an URL issue last time we run the notebook.

As we wanted to run it again, to reflect the changes manually introduced in the flags and eventually identified new items with URL problems, we ran into an error with the URLCheck() and the checkURLValues() functions that we are not able to solve on our own.

Here is the screenshot of the error Screenshot_2022-11-22_at_09.42.29

@cesareconcordia do you think that is something you could look into to help us, please?

dpancic commented 1 year ago

In GitLab by @cesareconcordia on Nov 23, 2022, 10:48

I'll look at this in the afternoon, will let you know asap

dpancic commented 1 year ago

In GitLab by @cesareconcordia on Nov 30, 2022, 11:12

Unfortunately I cannot reproduce the error... I've added a new control to the CheckURLValues() that should give more information about the error. When you have time:

let me know.

laureD19 commented 1 year ago

It works partially. After few more tests, it seems that the datasets category causes the error, but not the other categories. could you also try out on your side for datasets @cesareconcordia ?

I wrote back to the MP for the other item categories. 11 new URL-flags raised to manually curate.

An additional question relates to the un-flagging of manually corrected items. From what you were explaining @cesareconcordia, I thought the setHTTPStatusFlags method would also unraise the URL-flag of items that have been manually corrected in the meantime, but it doesn't seem the case. Do I need to use another function?

notify also @aureon249

cesareconcordia commented 1 year ago

hi @laureD19: the problem you're having checking URL for dataset items should be fixed now, please get the code from the repository and run the notebook, let me know if it still persists.

I'll check the behaviour of setHTTPStatusFlags to understand why it does not unraise URL-flags.

aureon249 commented 1 year ago

Hi, I fetched the code from dev branch and ran the URL validity check (1.1) in the notebook for all five categories on stage. The error mentioned above did not reappear.

laureD19 commented 1 year ago

Hi @cesareconcordia and @aureon249 !

I've tried out the URL checks and it now works with all categories. Thx!

Regarding the unflagging of items, my 2cts: