Merck / bgc-pipeline

MIT License
10 stars 2 forks source link

Error: Failed to pull data from the cloud: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied #2

Open mkoohim opened 3 years ago

mkoohim commented 3 years ago

Hello, I tried to pull the dvc file. I made my own aws configuration by my own aws access key id and secret id. Also I've set the S3 policy to access to the s3:ListBucket but still get the following error when run dvc pull:

Error: Failed to pull data from the cloud: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

I think you need to set some parameters in your bucket in S3 to make it accessible for public. Please see the following video for more information: https://www.youtube.com/watch?v=_dOBPpeBAxs

Could you please check things and advice how we can solve the problem.

Regards, Mohamad

prihoda commented 3 years ago

Hi @mkoohim, the DVC files cannot be used to download the actual files, they are only here to document the commands used. If you want to retrain DeepBGC on new BGC data, please refer to the DeepBGC repository: https://github.com/Merck/deepbgc#train-deepbgc-on-your-own-data

Training and validation data can be downloaded from release 0.1.0 and release 0.1.5.

mkoohim commented 3 years ago

Hello. Thanks for your reply. May I ask if there is anyway to access to the corpus data? https://github.com/Merck/bgc-pipeline/tree/main/data/bacteria/corpus

I couldn't find it in the previous versions.

Thanks,

prihoda commented 3 years ago

Hi @mkoohim, I added the missing file to the 0.1.0 release: https://github.com/Merck/deepbgc/releases/tag/v0.1.0

Please keep in mind that it was detected using Pfam 31.0, the current Pfam version is 34.0. To generate an updated corpus, you would have to run HMMSCAN using Pfam 34.0 on thousands of genomes, which would be very computationally intensive. We might do it at some point in the future, but not in the near term.