I2PC / scipion

Scipion is an image processing framework to obtain 3D models of macromolecular complexes using Electron Microscopy (3DEM)
http://scipion.i2pc.es
Other
76 stars 47 forks source link

Support of Postgresql database (or any other type of database) #2072

Open ammarik opened 4 years ago

ammarik commented 4 years ago

Hello, is it possible to use for example Postgresql database instead of Sqlite database? I've found that there is a Postgresql mapper (https://github.com/I2PC/scipion/blob/master/pyworkflow/mapper/postgresql.py), but it seems to be outdated and unmaintained. I also found that database layer does not seem to be well separated from other parts of the code (manipulation with Sqlite db files at various parts of code, ...), there is no abstraction, which is a terrible sw design and it hinders of simple addition of any new db implementation.

We would love to use a different type of database than Sqlite. Is there any way to use database like Postgresql, Mysql, ... instead of the Sqlite db? Are you planning to add possibility to use some other database?

delarosatrevin commented 4 years ago

Dear @ammarik

We started some proof of concept using Postgress, but it is far from complete. The Mapper layer is designed to be abstracted from the database, but as you already noticed, there are a few places where it is not so well isolated but should be only a few.

Regarding the use of other SQL engines, you should notice that we have two use of the mapper. One is for the project entities and their relations. The other is for each individual set. For the second, I think it can be complicated to use an SQL server that is not independent file, but it should be possible.

Anyway, Sqlite performs well in most of the cases, so I don't think this is a priority right now. Do you have any particular problem that you are interested in other mapper implementations?

ammarik commented 4 years ago

Dear @delarosatrevin, thank you for your answer. Where can I find the proof of concept that you mentioned? Is there separate branch with it, or is it the postgresql mapper that is in the master branch?

I would like to execute different protocols on different nodes (we have multiple computing nodes) and at the same time I don't want to copy sqlite databases between nodes. To accomplish this, one of things I need, Is to have a central (e.g. Postgresql) database.

Do you think this usecase is achievable?

I know that there is a support for Slurm queue system, but I think it isn't suitable for my usecase.

pconesa commented 4 years ago

Hi @ammarik can you share the filesystem across your nodes?

ammarik commented 4 years ago

Hello, I tried to use AWS S3 storage mounted to AWS EC2 instances, but in the case of large projects, with many protocols, behavior of the Scipion was slow. I guess it was caused by frequent operations (accessing, copying) with sqlite files. That's why I'm trying to use database server.

I would like to be able execute different protocols on different AWS EC2 instances, while all the instances are connected to one database. I would like to distribute protocol executions to multiple different instances (computing nodes). For example I would like to be able to execute protocol that requires GPU on GPU EC2 instance and at the same time execute protocol that doesn't require GPU on some non-GPU instance, both connected to the same database.

delarosatrevin commented 4 years ago

Thanks @ammarik for sharing you use-case. If you have a shared filesystem it might work, also right now it is not possible to send jobs to different machines without having a queueing system. Reducing the amount of I/O operations is something we should check in general, not only for the sqlite databases.

On Mon, Jan 13, 2020 at 9:55 AM ammarik notifications@github.com wrote:

Hello, I tried to use AWS S3 storage mounted to AWS EC2 instances, but in the case of large projects, with many protocols, behavior of the Scipion was slow. I guess it was caused by frequent operations (accessing, copying) with sqlite files. That's why I'm trying to use database server.

I would like to be able execute different protocols on different AWS EC2 instances, while all the instances are connected to one database. I would like to distribute protocol executions to multiple different instances (computing nodes). For example I would like to be able to execute protocol that requires GPU on GPU EC2 instance and at the same time execute protocol that doesn't require GPU on some non-GPU instance, both connected to the same database.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/I2PC/scipion/issues/2072?email_source=notifications&email_token=ACAM6WAGG657DIT63JSPYELQ5QT7ZA5CNFSM4KFGD5H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIX6FSA#issuecomment-573563592, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAM6WB44R5H5BFWGHQWGVDQ5QT7ZANCNFSM4KFGD5HQ .