bio4j / bio4j-titan

Titan-specific bio4j implementation
https://github.com/bio4j/bio4j
6 stars 2 forks source link

S3 based data distribution #45

Closed eparejatobes closed 8 years ago

eparejatobes commented 9 years ago

We need to have this in a minimally organized way.

eparejatobes commented 9 years ago

Just so I don't forget later, to avoid paying a lot due to people downloading TBs of data from who knows where, we have two options

eparejatobes commented 9 years ago

I set the releases.bio4j.com bucket requester-pays, with public read-only permissions. Right now I'm copying files from that bucket to another one in EU, releases.eu.bio4j.com. This should be solved as part of the import process.

eparejatobes commented 9 years ago

OK those buckets where in the wrong account. Right now we have

  1. eu-west-1.releases.bio4j.com
  2. eu-west-1.raw.bio4j.com

cc @pablopareja

eparejatobes commented 9 years ago

we should expand this, move to docs after it stabilizes

Data import process overview

get raw stuff
  1. retrieve raw data from data sources (UniProt FTP etc) from an EC2 instance with good networking; hi1.4xlarge spot on eu looks good.
  2. extract if needed, store it in eu-west-1.raw.bio4j.com scoped by data source name and version/date

This bucket will have convenient lifecycle settings so that old raw data gets pushed to Glacier, we use reduced redundancy, etc. With respect to access, it is public-read with requester pays.

run import for module xxx
  1. get the binaries for the database from eu-west-1.releases.bio4j.com
  2. get the raw data that you could need from eu-west-1.raw.bio4j.com
  3. run the import process, push the generated binaries to eu-west-1.releases.bio4j.com

Same as for raw, eu-west-1.releases.bio4j.com has archiving, public-read requester-pays etc.

eparejatobes commented 9 years ago

ping @pablopareja

pablopareja commented 9 years ago

What do you mean by

scoped by data source name and version/date

:question:

eparejatobes commented 9 years ago

that the name of the S3 object (including the prefix) should contain that information

pablopareja commented 9 years ago

something like:

bio4j_0_12_enzyme_12_03_2014.dat

:question:

eparejatobes commented 9 years ago

well the first part should not be there, this has nothing to do with Bio4j (I mean the file itself). I'd

pablopareja commented 9 years ago

enzymedb_12_03_2014_enzyme.dat

under the folder

s3://eu-west-1.raw.bio4j.com/bio4j_0_12/

:question:

laughedelic commented 9 years ago

I think, bio4j_0_12/ shouldn't be there (and by the way, what about semantic versioning?). So

s3://eu-west-1.raw.bio4j.com/enzymedb/12_03_2014/<the_original_file_name>.dat
pablopareja commented 9 years ago

I couldn't care less about it... :smiley: Please let me know the best option so that I can start uploading them :space_invader:

eparejatobes commented 9 years ago
pablopareja commented 9 years ago

but what's <resource-scope> :question: Could you please write what it would exactly be for this case?

eparejatobes commented 9 years ago

enzymedb

pablopareja commented 9 years ago

so no trace in the end about the bio4j version where it's used right?

eparejatobes commented 9 years ago

of course not! :hamburger:

laughedelic commented 8 years ago

This is done. Closing