Unable to download data using the sample manifest given in this repo

diyabasu97 commented 3 months ago

Hi @AhmedElsherbini. I came across your repo while trying to find a way to download dataset from HMP database. However, I am getting error when I am trying to download the data in the example_manifest.tsv which you have provided. Here is the command which I ran :-

$ python3 download_urls.py -i example_manifest.tsv File number 1 HMP2_J09154_1_NS_T0_B0_0120_ZYHHR4Z-1023_ANBEN.biom Erorr in file HMP2_J09154_1_NS_T0_B0_0120_ZYHHR4Z-1023_ANBEN.biom File number 2 HMP2_J09154_1_ST_T0_B0_0120_ZYHHR4Z-1023_ADM3N.biom Erorr in file HMP2_J09154_1_ST_T0_B0_0120_ZYHHR4Z-1023_ADM3N.biom Finally finished!

I have been trying to download a dataset from HMP, but nothing seems to work for me. Can you please help me with this. Thanks in advance.

Regards, Diya.

AhmedElsherbini commented 3 months ago

Hello Diya.

Thanks for contacting me.

Indeed, I see the error now also. which is strange as nearby it was working well. Therefore, I am figuring out a solution now. I will let you know.

Best, Ahmed

AhmedElsherbini commented 3 months ago

Hi again,

Indeed, the problem stems from the portal itself, some files are not able to be downloaded in any way either manually or per browser example1, but still can be downloaded normally example2.

My explanation, upon more data influx, they modifying their strategy (compressing some files,use s3 amazon,.....). Maybe it is just a temporary problem, just a guess!

I updated the script and the a new manifest file, hopefully this works with

To examine some current manifest, you have two options.

Randomly pick few of them.

1-Manually, on the website itself like in example2, try the manual download button per individual file, if it works, a good sign.

2- copy and paste the link (https://.............bz2) in your browser, if you can see be downloaded, then this is a good sign.

Let me how stuff goes with you.

Best, Ahmed

diyabasu97 commented 3 months ago

Hi Ahmed,

Thank you for the quick response. As you mentioned for some files, I am able to manually download from the browser and can also use wget command but for some of the other files the download button does not work and both copy pasting the link in browser and running wget command gives some forbidden error.

$ wget "https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmmrc/v1/SRS011271_vs_KEGG_v54.tar.bz2" --2024-04-15 17:57:22-- https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmmrc/v1/SRS011271_vs_KEGG_v54.tar.bz2 Resolving downloads.hmpdacc.org (downloads.hmpdacc.org)... 134.192.156.26, 64:ff9b::86c0:9c1a Connecting to downloads.hmpdacc.org (downloads.hmpdacc.org)|134.192.156.26|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2024-04-15 17:57:23 ERROR 403: Forbidden.

So do you think in this case the problem is with the file itself and there is no way to get it? Also, in this case how do you suggest I should look for data because manually opening each and every file will be tedious process. For example, if I want to get some abundance matrix data for HHS study, there are about over 2000 files. So manually opening and checking each link won't be possible. So in this case do you have any suggestion on how should I be proceeding to get the data?

Any help would be appreciated. Thanks a lot.

Regards, Diya

diyabasu97 commented 3 months ago

Hi again, I have one doubt. Are we not able to access those links because the link is broken/does not work or is it because of some permission issue. Because when I copy and paste the URL in the browser it says 403 forbidden- You don't have permission to access. If it's a permission issue I am not sure how to get that. Also, in their website they state that "All HMP data and resources are freely available for browsing and download". So, I am confused what is the issue with some of those links. Can you please let me know if you have any idea on this? Thanks again.

AhmedElsherbini commented 3 months ago

So, now, I re-updated my script seconds ago. With your input manifest. it does download your available files. then, it gives finally another 2 TSV files, one for the successfully downloaded files and the failed manifested file as a separate manifested file. The failed manifest file you can use it to download your data whenever these issues are addressed, I suppose it could be something temporary.

Just let me know, if everything goes well with you :)

diyabasu97 commented 3 months ago

Hi Ahmed, I really appreciate your patience and time for helping me out. Thanks for updating the script.This will be useful in keeping a track of the files which successfully got downloaded and the ones which failed. To me this seems to be a quick workaround for now. However, one thing I noticed is that all the files are getting listed in the successful_manifest.tsv as well as failed_manifest.tsv whereas ideally only the list of successful files should be there in sucessful_manifest.tsv and only the failed ones in failed_manifest. Thanks for your help again.

Regards, Diya

diyabasu97 commented 3 months ago

Hi again, I observed that in the example_manifest file, in the URL , you have also provided the s3 bucket link and the ftp link as well. Do you know how I can access the files using ftp or s3 bucket. Because when I am trying to access the s3 bucket it asks me for the access key. Let me know if you have any idea on this. Thank you.

Regards, Diya

AhmedElsherbini commented 3 months ago

Hi Diya,

To be honest, the AWS s3 is new for me and I am not familiar with it, I will try to see if I can add this to my code, as HTTPS was the main focus.

Out of the 2000 files, how many you can download with HTTPS? I see you also had a hit with HMP_client which has the s3/ftps extra over HTTPS, did you download the same amount of files with them?

diyabasu97 commented 3 months ago

Hi Ahmed,

I used the s3 links given in your manifest file and was successfully able to download the files using the portal client. I didn't try running the fully the 2000 files using https links and your python scripts because the first 400 or so files failed, so I didn't proceed further. I was actually trying to get all the abundance matrix file for HHS study from the HMP database.

I have one question from where you got the s3 links for your files. I am unable to search it. For example, if I want to get the s3 link for this file how do I get it. https://portal.hmpdacc.org/files/1670203039de370df9a35a04373371e1

If I get the s3 link for my files then I am planning to try to download the file using s3 link since https is not working.

Regards, Diya

AhmedElsherbini commented 3 months ago

Hi Diya,

Makes sense,

My example files, I randomly picked them from the website.

So to get the s3 link, from the normal manifest file , you can see in the URL column a long string which is separated by comma (https,ftp,s3).

As portal client shall automatically go to pick the s3 link from the column, I suppose you should not manipulate anything in the manifest file.

Best, Ahmed

diyabasu97 commented 3 months ago

Hi Ahmed,

Oh I see. So, for your files the s3 links were already there in the manifest file. For mine, only the https links were there and there were no s3 links so I thought that you might have manually taken the s3 links from somewhere and put it there. Here is a sample manifest file which I was using just for your reference. sample_manifest.txt

So now if I am able to locate the s3 links for different file types then I should be able to access them. Thanks for your help and support.

Regards, Diya

AhmedElsherbini commented 3 months ago

That seems good, good luck then :)

AhmedElsherbini / download_hmp_data

Unable to download data using the sample manifest given in this repo #2