ESPRI-Mod / synda

ESGF Downloader (this is a deprecated repository, the tool has now moved to https://github.com/ESGF/esgf-download)
https://espri-mod.github.io/synda/
21 stars 11 forks source link

more data transfer bottlenecks, solve with redesign #133

Open painter1 opened 4 years ago

painter1 commented 4 years ago

In issue #132 I described a serious performance issue, and a solution in which database references are cached. Some changes in the design of Synda may help with this, as well as other issues (One which I have in mind is that from time to time I see failures, even crashes because "database is locked"). There are two design improvements which would probably solve the performance issue:

  1. Parallelize by data_node. There would be a separate "event loop" for each data node. This in turn would fork of threads for individual files, as is presently done. Most of the "serial bottleneck" issues with data transfer would be avoided if each data_node did not have to wait for all the others to be processed.
  2. Change from SQLite to PostgreSQL. Postgres has a reputation for being good at handling multiple simultaneous database accesses. This change would be necessary for #1 to work well.

It is obvious that this proposal, especially #2, involves a huge amount of work. That is why I didn't do it. Some day we may have enough other reasons to change databases so that we will go ahead and do it.

AtefBN commented 4 years ago

Hi Jeff, currently working on adding an ORM layer in Synda with sqlalchemy. This should make the shift from sqlite to postgres a relatively easier job. However implementing parallel access to db needs to be handled carefully. I think the general feeling in the synda user base is that sqlite is reaching its limit. Although this would add extra constraints to the requirements of installing synda but I can't see any other option.