MassIVE download links are still FTP

fabianegli commented 2 years ago

It seems that MassIVE does not provide download links for its resources over HTTPS, but it is still possible to download resources over HTTPS from the website (e.g. MSV000080374)

It is not clear to me if the content of the MassIVE archive can be downloaded with non-query URLs. Does someone know if they exist and if so how to compose such links, or should we contact them directly to ask for https URI download link support?

A current example download link from :

https://massive.ucsd.edu/ProteoSAFe/DownloadResultFile?file=f.MSV000080374/ccms_parameters/params.xml&forceDownload=true

and the corresponding FTP link

ftp://massive.ucsd.edu/MSV000080374/ccms_parameters/params.xml

This file concerns the following annotation file: annotated-projects/PMID28625833/PMID28625833.sdrf.tsv

ypriverol commented 2 years ago

As I mentioned before, we should make the guidelines as simple as possible. I think, the name of the field is URI which means that can be FTP or https. This remove the problem of knowing what protocol, if is supported by the provider etc. The client knows that is a URI, it can be http, https, ftp, etc.

fabianegli commented 2 years ago

On one side, I agree that the SDRF standard is and should be permissive to any URI format and on the other I think we should try and nudge the community to provide safe download resources directly from these URIs.

What do I mean by this? I think there should be, if not a red light at leas an orange one if some data provider does not use safe protocols to distribute data and if this is not the case, we should inform the SDRF users and ideally approach the data provider to ask for an upgrade of their service.

This could for example be done by adding a check in the CI, but not blocking the merge if it fails. It would still raise a flag so that the authors can react. This would be desirable because someone might use the ftp links for ebi archive resources even though https links could be used instead.

ypriverol commented 2 years ago

@fabianegli a file format tries to be agnostic as much as possible from the implementation. It is a compromise between implementation/modeling-design. If a provider wants to distribute the data in http they should be able to do it in the format. The users and clients of the service should be in charge of requesting more secure protocols, not de format.

In the CI/CD we are now validating a lot of things including, the SDRF, the MAGE-TAB and inside the SDRF even the ontology terms. I think validating that the URLs are available, or they exist, will be a lot for the CI/CD.

However, I do think in the sdrf-pipelines we can check if is a proper URI or valid. We don't validate the URIs, if the user add an string the parser will not complain. This should be validated in the tool.

fabianegli commented 2 years ago

If a provider wants to distribute the data in http they should be able to do it in the format. The users and clients of the service should be in charge of requesting more secure protocols, not de format.

In the CI/CD we are now validating a lot of things including, the SDRF, the MAGE-TAB and inside the SDRF even the ontology terms. I think validating that the URLs are available, or they exist, will be a lot for the CI/CD.

I totally agree. What I propose is not enforcement of a protocol but informing a person and the repo maintainers (us) about the presence of URIs with insecure protocols. I think of it as a service to submitters and maintainers. I am not even sure if I would want to put that into the sdrf-pipelines - until now I was thinking of a very simple bash script checking if the strings http:// or ftp:// are present in the SDRF file. That would add add a minuscule amount of computational burden to the CI.

However, I do think in the sdrf-pipelines we can check if is a proper URI or valid. We don't validate the URIs, if the user add an string the parser will not complain. This should be validated in the tool.

I think this can be easily done within the sdrf-pipelines validate command with the following urllib utilities:

from urllib.request import urlopen
from urllib.parse import urlparse

uri = "https://ftp.ebi.ac.uk/pride-archive/2019/11/PXD012986/QExHF04048.raw"

# check for uri validity
parsed_url = urlparse(uri)
if all([parsed_url.scheme, parsed_url.netloc, parsed_url.path]):
    print(f"This is a valid URI: {uri}")

# check for availability
if urlopen(uri).getcode() == 200:
    print(f"This resource is reachable on the internet: {uri}")

But I am not sure if our CI should send requests to all resources URIs. This might lead to the violation of TOS of data providers in some cases. At least the urlopen() doesn't download the resource and thus limits the traffic but sees if the server returns a status code 200 indicating that the resource is actually available at the given URI.

ypriverol commented 2 years ago

@fabianegli I think it should be in the sdrf-pipelines as you proposed. But not in the CI. Currently, we have more than 10K urls. Checking that can block our CI/CD from GitHub.

ypriverol commented 2 years ago

Should we close this issue @fabianegli

fabianegli commented 2 years ago

I think the uri validation and availability checks should have a separate issue anyways.

bigbio / proteomics-sample-metadata

MassIVE download links are still FTP #651