The DCD mapping utility is wrapped in a small FastAPI application.
It is invoked by a new MaveDB worker task.
These will initially reside on the same worker node, but this setup allows the greatest degree of flexibility for other deployment scenarios should we need to scale up.
The worker will be responsible for regulating the flow of requests to the mapping API. Since the mapping API does not rely heavily on external resources that it must wait for, but spends most of its time using the CPU, we think it will make sense to run one API request at a time.
We have decided to host a persistent seqrepo instance on a shared EBS disk.
We would prefer not to have to treat this seqrepo instance's content as part of the MaveDB data set. It seems that there is no use case for which we must use GA4GH identifiers (which are sequence hashes) to look up sequences that we have added to this seqrepo instance. But we have not drawn a conclusion about this yet.
Even if, as we hope, the seqrepo instance does not need to persist as part of the MaveDB data set, it is more efficient operationally to have a single instance that normally persists. This avoids the long seqrepo setup time that would interrupt operations if the seqrepo instance were, for instance, created as a Docker container. It also allows us to continue operating when the external seqrepo data sources are unavailable, even if our worker node is recycled.
Develop a new worker task that invokes the variant mapping utility. Trigger execution for every new score set.