Open hannes-ucsc opened 8 years ago
We should keep an eye on this, if they start deprecating SDB we need to start on a Dynamo job store replacement.
Any chance that this can be revisited?
@abatilo s3 is now strongly consistent, and so this issue is now about replacing SDB with s3. This will probably be worked on relatively soon actually (sometime in the next few months).
Would it be a big lift? I would be curious to know if I could help.
@abatilo Medium sized, I would guess? It still needs to be explored.
Most of the work would involve removing the current sdb functionality, identifying everything it's shuttling back and forth (primarily items with job attributes, representing jobs to be processed), and then making the remapping that will fetch/put files into s3. Jobs would map to job files in s3, and the presence of one signifies a job yet to be run, and a job that has finished should no longer have a file. Most of the work will be in the https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py file.
Some examples:
Loading a job currently uses a jobstore id to key the attributes for a job out of sdb: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py
This would need to be changed to using the jobstore id to fetch a bucket file by bucket name (aws jobstore name) and key.
Same with deleting a job: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py or listing jobs: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py#L322
There are also some odd spots where not finding a job needs to be handled specific to sdb, for example: https://github.com/DataBiosphere/toil/blob/98dbf33147ed029800dd43a73ee5c64e83feda7b/src/toil/leader.py#L973
If you want to tackle this, or a portion of it, we'd be happy to have the help and I'd be glad to review code progress on this as well.
@abatilo We have sprint planning tomorrow and I'm going to propose putting this into the upcoming sprint.
That's awesome. Thank you
@DailyDreaming Could we still consider DynamoDB? S3 has throughput limits which might become problematic.
@abatilo Yes, that's certainly still a possibility. What kind of limits concern you? First hit searching indicates 3500 requests/second to PUT data, and 5500 requests per second to GET data on s3. I'm not sure we're going to be hitting those limits, though it does look like dynamodb has higher limits.
Members of my informatics team have expressed to me that with the current usage of S3, we've had pipelines fail due to hitting S3 limits. I haven't had time to dig in yet but that's why I wanted to bring it up here.
I see. The database is more to enforce strong consistency, so I'd have to investigate how much the rate will increase (which I suspect would mostly be from heading a file to check for existence, rather than checking the db).
➤ Adam Novak commented:
Since S3 is strongly consistent now, we’re planning to just use that and not DynamoDB.
Will it be possible to use other S3 backends than AWS?
Hello,
It would be interesting to get rid of the amazon dependency to be able to use on-premise kubernetes platforms.
And therefore to replace sdb with something other than an amazon solution like dynamodb.
Would it be possible to consider solutions like Redis, etc.?
Regards
➤ Adam Novak commented:
Lon is making a cool control flow diagram for this.
We've been following this issue for a long time, hoping that using a strongly consistent S3 backend as mentioned by @unito-bot would be adopted.
Specifically we'd like to use Ceph's S3-compatible object storage, which guarantees strong consistency. Deploying Ceph is a common cluster storage solution for on-premises Kubernetes, since the Rook operator does the heavy lifting.
We have Ceph now at UCSC, and using Ceph directly (instead of through the shared filesystem) might be interesting.
What could we (at Inria) do to contribute?
Lon will probably be the one who would work on this, though it will be a while before this is added to the sprint. We don't have many internal people using the AWS implementation so we haven't had much spare development time for this.
We have a vague idea on implementing jobstore plugins similar to batchsystem plugins, so any ideas/recommendations there can be helpful.
Community contributions are of course always welcome. Unfortunately I'm unsure where those contributions could go, as this is Lon's task and I'm unsure of its current progress. If you want, you could ping him and ask where contributions for this could go.
Quoting an AWS support professional in case 1767267511:
If Toil should run on newly opened AWS accounts, we need to phase out SimpleDB.
I propose that we create a new, second implementation of the AWS job store that uses DynamoDB. The new implementation should be accessible under the
aws
job store locator, while the old one becomesaws_old
.The reason I didn't use DynamoDB in the first place was the payment model, which is based on a flat rate as a function of a configurable ("provisioned" in Amazon lingo) request volume. Toil would have to set that request volume to user-specified value (with a sensible default) before a workflow starts and make sure that it configures it back to the lowest possible value on exit.
┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-350