ToniWestbrook / paladin

Protein Alignment and Detection Interface
MIT License
60 stars 7 forks source link

paladin prepare work around - pre-indexed UniRef90 available? #44

Closed rmondav closed 3 years ago

rmondav commented 3 years ago

like others i ran into memory issues with paladin prepare when run on a 256G node. UniRef90 v.2021_03 is currently 31G and the packed file (.pac) 85G. I have made a special request to use a 1Tb node for 12 hours. Questions:

  1. Does 12 hours seem a reasonable time?
  2. My job has been queued for 2 days already and has not yet even been allocated a run time so it may be a week or more before I can get an indexed database. It was suggested at one point that a pre-indexed UniRef90 might be made available. Does one exist?
rmondav commented 3 years ago

By borrowing a colleagues compute time I managed to get access to a fat node queue quicker. The indexing took around 3.5 hours on a 1Tb node.

rmondav commented 3 years ago

re-opening as it turns out paladin prepare didn't index the database.

[M::command_align] Loading the index for reference '/path/to/paladin_indexed_uniref90_2103/uniref90.fasta.gz'... [E::index_load_from_disk] Failed to locate the index files

rmondav commented 3 years ago

FYI for others trying to prepare/index a 2021 UniRef90 database. The first step to prepare the database is to download the zipped file which is actually a single thread operation and can be run on a single core taking around 3 hours (you don't want to tie up a whole node for this step). second step can be (optionally) indexing, which is memory and time intensive taking around two days on a fat node: on 'my' cluster it ran 46.5 hours on a 1Tb node, used max memory 760.4G. The preparing step, which is a clean up process, can be run second or third but I'm running it third as I could not get the indexing to work on a pre-prepared db. It took 1 hour using minimal memory.

So in total the 202103 UniRef90 paladin indexed database takes at least 51 hours to prepare (including download and indexing), 800G RAM, and is around 400G in file size.

The UniRef90 folder after indexing by BWT should contain the following (with approximate sizes): 32G uniref90.fasta.gz
24b uniref90.fasta.gz.amb
32G uniref90.fasta.gz.ann
254G uniref90.fasta.gz.bwt
43G uniref90.fasta.gz.pac
26b uniref90.fasta.gz.pro
22G uniref90.fasta.gz.sa

ToniWestbrook commented 3 years ago

Hi @rmondav - apologies for the delay on responding on this. I'm glad you got the preparation/indexing done. The time to index has grown over the years since the size of the UniRef90 has increased, but taking about 2 days sounds right. Just a heads up, you shouldn't need to manually download the file though, using the "paladin prepare -r2" command will automatically download the UniRef90 database, index it, and then prepare it.

We don't have a publicly available preindexed uniref90 right now, but it's something we've talked about many times in the past (and since it takes at least ~700GB or so of RAM currently index like you saw, it's becoming more of an issue), so I'm going to try to make that happen soon. In the meantime, I also suggest to people to filter out entries from the UniRef90 that they know they won't be mapping to (eg HUMAN) to pare down the size. Let me know if you run into any other issues. Thanks -