S3 based data distribution

eparejatobes commented 9 years ago

We need to have this in a minimally organized way.

eparejatobes commented 9 years ago

Just so I don't forget later, to avoid paying a lot due to people downloading TBs of data from who knows where, we have two options

Requester pays buckets simple to setup, not so much to use (for everyone else, OK for us)
funny policy based on EC2 CIDR blocks would be straightforward to use, kinda involved to setup and test. See also AWS IP address ranges

eparejatobes commented 9 years ago

I set the releases.bio4j.com bucket requester-pays, with public read-only permissions. Right now I'm copying files from that bucket to another one in EU, releases.eu.bio4j.com. This should be solved as part of the import process.

eparejatobes commented 9 years ago

OK those buckets where in the wrong account. Right now we have

eu-west-1.releases.bio4j.com
eu-west-1.raw.bio4j.com

cc @pablopareja

eparejatobes commented 9 years ago

we should expand this, move to docs after it stabilizes

Data import process overview

get raw stuff

retrieve raw data from data sources (UniProt FTP etc) from an EC2 instance with good networking; hi1.4xlarge spot on eu looks good.
extract if needed, store it in eu-west-1.raw.bio4j.com scoped by data source name and version/date

This bucket will have convenient lifecycle settings so that old raw data gets pushed to Glacier, we use reduced redundancy, etc. With respect to access, it is public-read with requester pays.

run import for module xxx

get the binaries for the database from eu-west-1.releases.bio4j.com
get the raw data that you could need from eu-west-1.raw.bio4j.com
run the import process, push the generated binaries to eu-west-1.releases.bio4j.com

Same as for raw, eu-west-1.releases.bio4j.com has archiving, public-read requester-pays etc.

eparejatobes commented 9 years ago

ping @pablopareja

pablopareja commented 9 years ago

What do you mean by

scoped by data source name and version/date

:question:

eparejatobes commented 9 years ago

that the name of the S3 object (including the prefix) should contain that information

pablopareja commented 9 years ago

something like:

bio4j_0_12_enzyme_12_03_2014.dat

:question:

eparejatobes commented 9 years ago

well the first part should not be there, this has nothing to do with Bio4j (I mean the file itself). I'd

put first a prefix which uniquely identifies the resource (like enzymedb in this case, I guess)
put the date etc in the prefix
keep the same file name as they have

pablopareja commented 9 years ago

enzymedb_12_03_2014_enzyme.dat

under the folder

s3://eu-west-1.raw.bio4j.com/bio4j_0_12/

:question:

laughedelic commented 9 years ago

I think, bio4j_0_12/ shouldn't be there (and by the way, what about semantic versioning?). So

s3://eu-west-1.raw.bio4j.com/enzymedb/12_03_2014/<the_original_file_name>.dat

pablopareja commented 9 years ago

I couldn't care less about it... :smiley: Please let me know the best option so that I can start uploading them :space_invader:

eparejatobes commented 9 years ago

s3://eu-west-1.raw.bio4j.com/<resource-scope>/yyyy-mm-dd/<the_original_file_name>

pablopareja commented 9 years ago

but what's <resource-scope> :question: Could you please write what it would exactly be for this case?

eparejatobes commented 9 years ago

enzymedb

pablopareja commented 9 years ago

so no trace in the end about the bio4j version where it's used right?

eparejatobes commented 9 years ago

of course not! :hamburger:

laughedelic commented 8 years ago

This is done. Closing

bio4j / bio4j-titan