DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
901 stars 240 forks source link

Replace SDB with S3 in AWS job store #964

Open hannes-ucsc opened 8 years ago

hannes-ucsc commented 8 years ago

Quoting an AWS support professional in case 1767267511:

I would recommend seeing if you would consider DynamoDB to replace your SimpleDB solution. DynamoDB is essentially the successor of SimpleDB, which is slowly being pulled out from active development. In fact, we're no longer offering that to new customers at this point.

If Toil should run on newly opened AWS accounts, we need to phase out SimpleDB.

I propose that we create a new, second implementation of the AWS job store that uses DynamoDB. The new implementation should be accessible under the aws job store locator, while the old one becomes aws_old.

The reason I didn't use DynamoDB in the first place was the payment model, which is based on a flat rate as a function of a configurable ("provisioned" in Amazon lingo) request volume. Toil would have to set that request volume to user-specified value (with a sensible default) before a workflow starts and make sure that it configures it back to the lowest possible value on exit.

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-350

cket commented 7 years ago

We should keep an eye on this, if they start deprecating SDB we need to start on a Dynamo job store replacement.

abatilo commented 3 years ago

Any chance that this can be revisited?

DailyDreaming commented 3 years ago

@abatilo s3 is now strongly consistent, and so this issue is now about replacing SDB with s3. This will probably be worked on relatively soon actually (sometime in the next few months).

abatilo commented 3 years ago

Would it be a big lift? I would be curious to know if I could help.

DailyDreaming commented 3 years ago

@abatilo Medium sized, I would guess? It still needs to be explored.

Most of the work would involve removing the current sdb functionality, identifying everything it's shuttling back and forth (primarily items with job attributes, representing jobs to be processed), and then making the remapping that will fetch/put files into s3. Jobs would map to job files in s3, and the presence of one signifies a job yet to be run, and a job that has finished should no longer have a file. Most of the work will be in the https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py file.

Some examples:

Loading a job currently uses a jobstore id to key the attributes for a job out of sdb: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py

This would need to be changed to using the jobstore id to fetch a bucket file by bucket name (aws jobstore name) and key.

Same with deleting a job: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py or listing jobs: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py#L322

There are also some odd spots where not finding a job needs to be handled specific to sdb, for example: https://github.com/DataBiosphere/toil/blob/98dbf33147ed029800dd43a73ee5c64e83feda7b/src/toil/leader.py#L973

If you want to tackle this, or a portion of it, we'd be happy to have the help and I'd be glad to review code progress on this as well.

DailyDreaming commented 3 years ago

@abatilo We have sprint planning tomorrow and I'm going to propose putting this into the upcoming sprint.

abatilo commented 3 years ago

That's awesome. Thank you

abatilo commented 3 years ago

@DailyDreaming Could we still consider DynamoDB? S3 has throughput limits which might become problematic.

DailyDreaming commented 3 years ago

@abatilo Yes, that's certainly still a possibility. What kind of limits concern you? First hit searching indicates 3500 requests/second to PUT data, and 5500 requests per second to GET data on s3. I'm not sure we're going to be hitting those limits, though it does look like dynamodb has higher limits.

abatilo commented 3 years ago

Members of my informatics team have expressed to me that with the current usage of S3, we've had pipelines fail due to hitting S3 limits. I haven't had time to dig in yet but that's why I wanted to bring it up here.

DailyDreaming commented 3 years ago

I see. The database is more to enforce strong consistency, so I'd have to investigate how much the rate will increase (which I suspect would mostly be from heading a file to check for existence, rather than checking the db).

unito-bot commented 2 years ago

➤ Adam Novak commented:

Since S3 is strongly consistent now, we’re planning to just use that and not DynamoDB.

stain commented 2 years ago

Will it be possible to use other S3 backends than AWS?

Guigzai commented 9 months ago

Hello,

It would be interesting to get rid of the amazon dependency to be able to use on-premise kubernetes platforms.

And therefore to replace sdb with something other than an amazon solution like dynamodb.

Would it be possible to consider solutions like Redis, etc.?

Regards

unito-bot commented 9 months ago

➤ Adam Novak commented:

Lon is making a cool control flow diagram for this.

davidjsherman commented 9 months ago

We've been following this issue for a long time, hoping that using a strongly consistent S3 backend as mentioned by @unito-bot would be adopted.

Specifically we'd like to use Ceph's S3-compatible object storage, which guarantees strong consistency. Deploying Ceph is a common cluster storage solution for on-premises Kubernetes, since the Rook operator does the heavy lifting.

adamnovak commented 6 months ago

We have Ceph now at UCSC, and using Ceph directly (instead of through the shared filesystem) might be interesting.

davidjsherman commented 6 months ago

What could we (at Inria) do to contribute?

stxue1 commented 4 months ago

Lon will probably be the one who would work on this, though it will be a while before this is added to the sprint. We don't have many internal people using the AWS implementation so we haven't had much spare development time for this.

We have a vague idea on implementing jobstore plugins similar to batchsystem plugins, so any ideas/recommendations there can be helpful.

Community contributions are of course always welcome. Unfortunately I'm unsure where those contributions could go, as this is Lon's task and I'm unsure of its current progress. If you want, you could ping him and ask where contributions for this could go.