cryoem / eman2

A scientific image processing software suite with a focus on CryoEM and CryoET
148 stars 50 forks source link

Please provide a list of EMAN2 dependencies #393

Closed samfux84 closed 5 years ago

samfux84 commented 5 years ago

Hi,

Neither on https://blake.bcm.edu/emanwiki/EMAN2 nor on github the list of required dependencies for EMAN2 is available.

Install dependencies

conda install cmake=3.9 -c defaults conda install eman-deps=14 -c cryoem -c defaults -c conda-forge

Not everybody uses conda. On many HPC clusters, conda is not available. On our HPC cluster, we do not support conda. Therefore could you please provide a list of dependencies required for EMAN2 ?

Best regards

Sam

shadowwalkersb commented 5 years ago

Please, see the list used to build eman-deps, https://github.com/cryoem/eman-deps-feedstock/blob/0a4718059822ff88bd5778249f4bdb7555e5175f/recipe/meta.yaml#L11-L41. This link, now, is available in the wiki instructions for source builds.

samfux84 commented 5 years ago

Thank you, this is exactly what I was looking for.

sludtke42 commented 5 years ago

Hi Sam, I'd like to clarify that, while you are welcome to try and tackle a non-Anaconda installation yourself if you like, we no longer officially support this approach, for source or binary installs. Since others may read this, let me just explain the reasoning:

A) no "support" is required for this to work. An end user should be able to put one of our Anaconda-based binaries in their home directory on a cluster, and it should "just work". The only potential exception to this statement is MPI/BQS issues, but there are straightforward instructions for handling this in most cases. B) Anaconda, like a Docker instance, contains ALL of its dependencies, so no system-level installations are required at all C) it doesn't interfere with anything else on the system. If the Anaconda bin folder is in your PATH, you can use it. If it isn't, you can't and it won't interfere with anything else. PATH is the only shell variable involved, so it is very well suited to module systems used on many clusters. D) While a user can put it in their home directory, it can also be installed system-wide with the same "just set PATH" philosophy E) Several other CryoEM software packages have begun considering Anaconda as a distribution environment as well. This offers our community the possibility for eventually getting things integrated and compatible. Anaconda offers tools for nicely dealing with dependency issues within Anaconda in a cross-platform compatible way. F) Anaconda is increasingly being used in bioscience teaching curriculums for learning Python, R and sometimes other languages. Its excellent integration of Jupyter Lab makes it really easy for non-programmers to get started with simple programming. G) The only real disadvantage of Anaconda is its installed size. Since it includes all dependencies, the full EMAN2 install on my Mac (source environment) is ~4 GB. However, given that a typical CryoEM project nowadays involves 1-10 TB of data, we think this remains tolerable, and it has made problems with binary installations so much easier overall that for us it is well worth the trade.

Anyway, regardless of how you handle it, good luck!


Steven Ludtke, Ph.D. <sludtke@bcm.edu mailto:sludtke@bcm.edu> Baylor College of Medicine Charles C. Bell Jr., Professor of Structural Biology Dept. of Biochemistry and Molecular Biology (www.bcm.edu/biochem http://www.bcm.edu/biochem) Academic Director, CryoEM Core (cryoem.bcm.edu http://cryoem.bcm.edu/) Co-Director CIBR Center (www.bcm.edu/research/cibr http://www.bcm.edu/research/cibr)

On May 27, 2019, at 5:27 AM, Samuel Fux notifications@github.com wrote:

Hi,

Neither on https://blake.bcm.edu/emanwiki/EMAN2 https://blake.bcm.edu/emanwiki/EMAN2 nor on github the list of required dependencies for EMAN2 is available.

Install dependencies

conda install cmake=3.9 -c defaults conda install eman-deps=14 -c cryoem -c defaults -c conda-forge

Not everybody uses conda. On many HPC clusters, conda is not available. On our HPC cluster, we do not support conda. Therefore could you please provide a list of dependencies required for EMAN2 ?

Best regards

Sam

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cryoem/eman2/issues/393?email_source=notifications&email_token=ACPKF2XJZ5WM6ZL5NCLYQ3DPXOZQTA5CNFSM4HP2UJD2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GV7332Q, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPKF2UO7H5FIWZNCOA5DB3PXOZQTANCNFSM4HP2UJDQ.

samfux84 commented 5 years ago

@sludtke42: Thank you for your reply.

I can fully understand that Anaconda simplifies the installation of software for many users. But I still think that it is important to provide the list of dependencies for people that do not use anaconda. There is no need to provide installation instructions for any of the dependencies, but without knowing what a software depends on, it might be quite difficult to install it.

I work as application specialist managing more than 300 applications and libraries on the HPC cluster of our university (we use the SPACK package manager, https://spack.readthedocs.io/ for installations).

I prefer to not use Anaconda as a simple installation can easily contain 100'000 small files, which is not optimal for high-performance file systems that are optimized for big files.

I would like to add some comments regarding some of your points A)-G)

A) Our users have a quota of 100'000 files/directories for their home directory (because there is a nightly backup running for ca. 2500 users, and before we introduced the quota, the backup was not finishing overnight because some users had several million files in their home). An Anaconda (or even a Miniconda) installation easily exceeds the quota and then the users complain that they can no longer write new files to their home directory.

Therefore this is not really helping the cluster users as long as Conda installations contain that many files.

B) We already provide a large number of centrally installed packages, therefore no need for the users to do redundant installations in Anaconda:

https://scicomp.ethz.ch/wiki/Leonhard_applications_and_libraries https://scicomp.ethz.ch/wiki/Euler_applications_and_libraries

E) For this purpose, we use SPACK, https://spack.readthedocs.io

G) On a HPC cluster, install size is not too important as storage space has become very cheap. What matters on HPC file systems is the number of files, as random I/O with a large number of files kills the performance of every HPC file system.

sludtke42 commented 5 years ago

Hi Sam, I realize there may not be a point to continuing this, as you should have the dependencies now, but we might as well finish the discussion off, again, as a reference for the next person with a similar intent.

On May 28, 2019, at 1:47 AM, Samuel Fux notifications@github.com wrote:

@sludtke42 https://github.com/sludtke42: Thank you for your reply.

I can fully understand that Anaconda simplifies the installation of software for many users. But I still think that it is important to provide the list of dependencies for people that do not use anaconda. There is no need to provide installation instructions for any of the dependencies, but without knowing what a software depends on, it might be quite difficult to install it.

The dependency list you now have, generated from Anaconda is one of two things: A) All of the libraries in the entire Anaconda package which gets installed with EMAN2. This could include many things which are required by Anaconda, but not actually required by EMAN2. B) Only the dependencies we install explicitly on top of the base Anaconda install. This list would exclude any dependencies which are part of the standard Anaconda install. so neither one is exactly perfect. If you install list A which is guaranteed to be complete, then you effectively have Anaconda.

The other point is whether things will function properly at all outside of an Anaconda environment any more. They certainly would have at one point (as we only adopted Anaconda a couple of years ago), but we have not paid attention to changes which may have made non-anaconda installs break.

I work as application specialist managing more than 300 applications and libraries on the HPC cluster of our university (we use the SPACK package manager, https://spack.readthedocs.io/ https://spack.readthedocs.io/ for installations).

I do understand. My group also manages the cluster co-op at BCM, and while we certainly have fewer users than the central clusters for a major university, I do understand the issues involved. SPACK is fine, of course, but doesn't really reduce the number of files installed, as it "installs every unique package/dependency configuration into its own prefix". That is just to say that it is very similar to Anaconda in this respect. I prefer to not use Anaconda as a simple installation can easily contain 100'000 small files, which is not optimal for high-performance file systems that are optimized for big files.

This is certainly true. The EMAN2/Anaconda does include 100,000+ files. However, these files would still exist when all of the dependencies are installed. While you will save in cases where a specific version of a dependency can be shared among packages via SPACK, I suspect that in practice this will be relatively rare, as EMAN2 will generally specify a specific version of each dependency known to work.

Additionally, while there are a lot of files, for example ~14,000 as part of BOOST. Only a tiny number of these are accessed at runtime. If we could readily identify only the files which were actually accessed during an arbitrary EMAN2 session, I'm guessing the number of needed files would fall to ~5000 or so. From the perspective of the LUSTRE or other parallel filesystems, the large number of files is really only a problem if they are actually accessed.

I would like to add some comments regarding some of your points A)-G)

A) Our users have a quota of 100'000 files/directories for their home directory (because there is a nightly backup running for ca. 2500 users, and before we introduced the quota, the backup was not finishing overnight because some users had several million files in their home). An Anaconda (or even a Miniconda) installation easily exceeds the quota and then the users complain that they can no longer write new files to their home directory.

Therefore this is not really helping the cluster users as long as Conda installations contain that many files.

Not my business to get into your institutions cluster management policies, but I'll say that in many fields, like genomics/bioinformatics, a large number of small files is simply how everything is configured to operate. There are cluster configuration strategies and policies that can deal with this sort of thing, but everyone's user base is different.

From an EMAN2 perspective (image processing, not bioinformatics), a typical high resolution project starts with 1-10 TB of raw data (often processed on a workstation), consisting of typically 1000 - 10,000 micrographs. After initial preprocessing and "particle picking" this data is typically reduced to ~100 GB of particle data which is actually processed on a cluster. After completing processing on a project like this, the project will easily contain 50,000 - 100,000 files, and perhaps 200 GB of total storage.

While some users will certainly have smaller projects, with only ~5000 files and 20-30GB of storage, the sorts of problems people with NIH R01 funding for structural biology research will tackle are definitely going to be more like the previous case. People who are doing serious CryoEM research have fairly significant computational/storage requirements. The workstation under my desk has 75 TB of high performance (~1GB/s) storage on it, and one of my cluster accounts has ~500,000 files (~7 TB) in active storage, and CPU usage is typically ~100,000 - 200,000 CPU-hr/yr.

Why am I bugging you with this? If someone is asking to use EMAN2, Relion, CryoSPARC, CISTEM or one of the other CryoEM software packages, it is likely that someone is planning to start doing CryoEM research, and you are going to start running into users with similar needs, and eventually you'll either have to deal with them or force them off your clusters onto some other resources.

B) We already provide a large number of centrally installed packages, therefore no need for the users to do redundant installations in Anaconda:

https://scicomp.ethz.ch/wiki/Leonhard_applications_and_libraries https://scicomp.ethz.ch/wiki/Leonhard_applications_and_libraries https://scicomp.ethz.ch/wiki/Euler_applications_and_libraries https://scicomp.ethz.ch/wiki/Euler_applications_and_librariesAgain, nothing against SPACK, or against centrally installing EMAN2 on a cluster. EMAN2 is distributed as part of SBGRID and other cluster distributions. My point was not that users SHOULD install it in their own account, simply that this was possible, and in many less professionally managed clusters, it may be their only effective option.

E) For this purpose, we use SPACK, https://spack.readthedocs.io https://spack.readthedocs.io/Yep, I understand. Again, this doesn't save you anything if there are specific version requirements on the dependencies which trigger SPACK to create an independent branch. You may get some reduction of redundancy this way, but it may not be nearly as much as you hope.

G) On a HPC cluster, install size is not too important as storage space has become very cheap. What matters on HPC file systems is the number of files, as random I/O with a large number of files kills the performance of every HPC file system.

Trust me, I do understand. While our operation is much smaller (~150 users), they are split evenly between bioinformaticians and structural biologists. Both groups tend to have hundreds of thousands of files, and it's pretty much unavoidable due to the data processing conventions in both fields.

While you are correct that random I/O on 100,000 files will bring most filesystems to their knees, this isn't actually what happens. While a project in CryoEM may include 100,000 files, in a typical job, perhaps 10,000 of those files will actually be accessed, and that access is usually largely sequential, spanning jobs which may run for 24 - 48 hours on a couple of hundred cores. So, the actual impact on the filesystem isn't so terrible when it comes to running jobs.

Coming up with a good backup strategy, however, can be a much more complex issue, particularly if you have to check the timestamp on 10,000,000 files on a LUSTRE filesystem :^(


Steven Ludtke, Ph.D. sludtke@bcm.edu Baylor College of Medicine Charles C. Bell Jr., Professor of Structural Biology Dept. of Biochemistry and Molecular Biology (www.bcm.edu/biochem) Academic Director, CryoEM Core (cryoem.bcm.edu) Co-Director CIBR Center (www.bcm.edu/research/cibr)

samfux84 commented 5 years ago

@sludtke42: Thank you for your reply and for taking the time for this discussion.

Please do not get me wrong, I don't want you to change anything with regards to EMAN2 (except that the list of dependencies is published on the wiki, which is meanwhile the case).

Not my business to get into your institutions cluster management policies, but I'll say that in many fields, like genomics/bioinformatics, a large number of small files is simply how everything is configured to operate. There are cluster configuration strategies and policies that can deal with this sort of thing, but everyone's user base is different.

I think that there is a misunderstanding and I am sorry for not being more clear about this. The 100'000 files/directories quota only applies to home directories. For guest users on our cluster, this is the only permanent storage they have. For research groups that invested into our HPC cluster, there are other file systems (NetApp, Lustre) where they can have hundreds of terabytes of data and millions of inodes.

My point with regards to the quota is just, that on many clusters, guest users have limits and if the software that they would like to use is not installed centrally, they can hardly install anything in their home directory that requires a Conda installation for the dependencies. For everybody else I don't see any problem.

Best regards

Sam

sludtke42 commented 5 years ago

Ok, like I said, not really my business, just thought archiving the discussion would resolve points others may ask in future. cheers


Steven Ludtke, Ph.D. sludtke@bcm.edu Baylor College of Medicine Charles C. Bell Jr., Professor of Structural Biology Dept. of Biochemistry and Molecular Biology (www.bcm.edu/biochem) Academic Director, CryoEM Core (cryoem.bcm.edu) Co-Director CIBR Center (www.bcm.edu/research/cibr)

On May 28, 2019, at 8:29 AM, Samuel Fux notifications@github.com wrote:

@sludtke42 https://github.com/sludtke42: Thank you for your reply and for taking the time for this discussion.

Please do not get me wrong, I don't want you to change anything with regards to EMAN2 (except that the list of dependencies is published on the wiki, which is meanwhile the case).

Not my business to get into your institutions cluster management policies, but I'll say that in many fields, like genomics/bioinformatics, a large number of small files is simply how everything is configured to operate. There are cluster configuration strategies and policies that can deal with this sort of thing, but everyone's user base is different.

I think that there is a misunderstanding and I am sorry for not being more clear about this. The 100'000 files/directories quota only applies to home directories. For guest users on our cluster, this is the only permanent storage they have. For research groups that invested into our HPC cluster, there are other file systems (NetApp, Lustre) where they can have hundreds of terabytes of data and millions of inodes.

My point with regards to the quota is just, that on many clusters, guest users have limits and if the software that they would like to use is not installed centrally, they can hardly install anything in their home directory that requires a Conda installation for the dependencies. For everybody else I don't see any problem.

Best regards

Sam

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cryoem/eman2/issues/393?email_source=notifications&email_token=ACPKF2REKRCLV3MFEEFHOBDPXUXRZA5CNFSM4HP2UJD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMD35Y#issuecomment-496516599, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPKF2VJIFAEQAR7B6ZAJH3PXUXRZANCNFSM4HP2UJDQ.

Icecream-blue-sky commented 3 years ago

Please, see the list used to build eman-deps, https://github.com/cryoem/eman-deps-feedstock/blob/0a4718059822ff88bd5778249f4bdb7555e5175f/recipe/meta.yaml#L11-L41. This link, now, is available in the wiki instructions for source builds. Hi,shadowwalkersb! Are these dependencies complete? Why I follow https://blake.bcm.edu/emanwiki/EMAN2/COMPILE_EMAN2_ANACONDA using conda create -n eman2 eman-deps-dev -c cryoem -c defaults -c conda-forge to install eman2 dependencies on windows(just ot know what dependencies are in eman2, then manually install them on linux), I found that the dependencies is different from the list you provide(https://github.com/cryoem/eman-deps-feedstock/blob/0a4718059822ff88bd5778249f4bdb7555e5175f/recipe/meta.yaml#L11-L41)? image

shadowwalkersb commented 3 years ago

The link on the Wiki is the correct one.