Rinoahu / SwiftOrtho

A high performance tool to identify orthologs and paralogs across genomes.
GNU General Public License v3.0
27 stars 11 forks source link

A couple of questions #15

Open genomeboy opened 1 year ago

genomeboy commented 1 year ago

Hi there,

I have a couple of questions...

1) I am running SwiftOrtho on around 700 bacterial genomes. I substituted step 1 with a diamond run (with a suitably increased max-hits pram) and by running the search in a highly parallelized way, got the initial searches done in an hour or so. So a suggestion might be to explicitly incorporate diamond into the pipeline.

2) Step 2 is taking a long time however, looking at the number of lines in (and rate of growth of) the "COs" file, if the analysis continues in a linear manner (and if my estimate of the expected number of best reciprocal matches is roughly correct) this step could take months. (while it only took a few minutes for a subset of 30 genomes that I ran as a test). So the question here is whether there is any way to accelerate this step. (EDIT: "come non detto" (pretend I didn't write this as we say in Italian), my projections were way off, and the step 2 run finished while I was writing the above... 6 hours total on a single core, so no complaints at all.)

3) I am currently on an mac M1 architecture. When I ran step 3 on a test set yesterday (40 genomes)I got the following.... although the run apparently finished and provided a very plausible set of orthology clusters:

% python3 /Users/genomeboy/SwiftOrtho/bin/find_cluster.py -i all_v_all_modified.out.orth -a apc > all_v_all_modified.out.orth.apc

/Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:408:30: error: use of undeclared identifier 'MAP_GROWSDOWN' o = _cffi_from_c_int_const(MAP_GROWSDOWN); ^ /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:408:30: error: use of undeclared identifier 'MAP_GROWSDOWN' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:408:30: error: use of undeclared identifier 'MAP_GROWSDOWN' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:408:30: error: use of undeclared identifier 'MAP_GROWSDOWN' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:408:30: error: use of undeclared identifier 'MAP_GROWSDOWN' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:408:30: error: use of undeclared identifier 'MAP_GROWSDOWN' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:408:30: error: use of undeclared identifier 'MAP_GROWSDOWN' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:422:30: error: use of undeclared identifier 'MAP_HUGETLB' o = _cffi_from_c_int_const(MAP_HUGETLB); ^ /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:422:30: error: use of undeclared identifier 'MAP_HUGETLB' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:422:30: error: use of undeclared identifier 'MAP_HUGETLB' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:422:30: error: use of undeclared identifier 'MAP_HUGETLB' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:422:30: error: use of undeclared identifier 'MAP_HUGETLB' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:422:30: error: use of undeclared identifier 'MAP_HUGETLB' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:422:30: error: use of undeclared identifier 'MAP_HUGETLB' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:436:30: error: use of undeclared identifier 'MAP_LOCKED' o = _cffi_from_c_int_const(MAP_LOCKED); ^ /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:436:30: error: use of undeclared identifier 'MAP_LOCKED' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:436:30: error: use of undeclared identifier 'MAP_LOCKED' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffix79ef103x35d6d3be.c:436:30: error: use of undeclared identifier 'MAP_LOCKED' /Users/genomeboy/SwiftOrtho/bin/pycache/_cffi__x79ef103x35d6d3be.c:436:30: error: use of undeclared identifier 'MAP_LOCKED' fatal error: too many errors emitted, stopping now [-ferror-limit=] 20 errors generated.

SO I tried to install again, and I see that there is some stuff that looks like conda is trying to install as if its x86. Is there a way around this.

The strange thing is that an installation of swiftortho in a linux box gives the same final file as generated above (without the errors) but took MUCH more time than on the M1 (which is why I am running the large analysis on the mac). So once again, any ideas/comments and also whether the clustering step is going to take forever with 700 (closely related) bacterial genomes?

thanks for any advice.

David

Rinoahu commented 1 year ago
  1. Integrating third-party software will involve licensing issues. I don't have such a plan. But you can use any software if they can generate hits in blast -m8 format.
  2. This happens when there are many duplicated genes in the genome.
  3. It seems like a cffi issue. I am not sure if cffi fully supports M1 processors. You can try MCL cluster. find_cluster.py -i all_v_all_modified.out.orth -a mcl
genomeboy commented 1 year ago

Hi,

Thanks for the rapid reply.

For point 1: Your reply makes perfect sense. I'm using diamond anyway but I now see why you don't directly advertise it! However, this makes me wonder if I should be fixing other parameters in diamond (max hsps per target or anything like that). I didn't manage to understand exactly what parameters your step 1 is sending to blastp. And as a "flippant" comment, I think that most people refer to -m8 as outfmt 6 now - so you are probably nearly as old as I am! ;)

For point 2, as I edited, it actually finished in reasonable time anyway (and for sure it closed out rapidly as the complexity of the problem decreased... NICE!). Looking at the final results I kind of reached the conclusion you provided. In fact, I think my outgroups are too distant and while some of the problem is duplicates, some comes from strange "co-orthology" loops from different core genomes. I also have a load of species-complex specific duplications and probably LGTs. For my next attempt I have removed outgroups and re-filtered ingroups on the basis of genomes that behaved badly in the first run.

For the cffi issue, I am still a bit confused. I tried to install also on a rosetta shell, but got similar issues... some of the (I guess conda related errors) were citing python2.7 (I have NO python 2 installations on the machine)... meh. Like I said, the strange thing is that despite the errors I got the same outfile (apc) from the silicon and from a linux install that seemed to install correctly. Also, I tried the mcl run (on the mac) and got the same cffi errors (but then it crashed out completely). Next time I will try the mcl also on the linux box and see what happens.

Is there any reason to think that the mcl algorithm should be "better" than apc?

Last question (for now). The "-y" parameter on step 2... what exactly is it measuring, % id per HSP or per gene, over the whole length? I ask because I figure that tuning this appropriately might improve both speed and accuracy? I would be grateful if you can provide insight on this and confirm also that it scales from 0-1.0.

Many thanks

David

Rinoahu commented 1 year ago

So true. I am not used to the latest version of blast. So, I still use old style: legacy_blast.pl blastall -p blastp ...

If mcl has been installed, you can directly use it "mcl foo_input --abc -te 12 -I 1.5". In our experience, mcl usually generates fewer clusters than apc. But I don't think the difference between mcl and apc can impact the conclusion. mcl also runs faster than apc because it supports multiple thread. apc is designed for large-scale genomes with limited RAM (for example, 4GB) which uses disk as cache and runs slower than mcl. If you have enough computational resources, you can try mcl first.

The -y is the identify of HSP region not the full length.

genomeboy commented 1 year ago

Cheers, much appreciated.

So “-y” is just column 3 (index2) from the m8 format?

I’m slaving away on getting mcl to build on the silicon…., I mean in general the processor is great, but damn, there is so much stuff that doesn’t run native on it.

I admit that I am not as organised as I should be with environments/conda etc, but really, it’s driving me mad! (Not your fault of course)

best

D

On 10 Mar 2023, at 18:17, Xiao @.***> wrote:

So true. I am not used to the latest version of blast. So, I still use old style: legacy_blast.pl blastall -p blastp ...

If mcl has been installed, you can directly use it "mcl foo_input --abc -te 12 -I 1.5". In our experience, mcl usually generates fewer clusters than apc. But I don't think the difference between mcl and apc can impact the conclusion. mcl also runs faster than apc because it supports multiple thread. apc is designed for large-scale genomes with limited RAM (for example, 4GB) which uses disk as cache and runs slower than mcl. If you have enough computational resources, you can try mcl first.

The -y is the identify of HSP region not the full length.

— Reply to this email directly, view it on GitHub https://github.com/Rinoahu/SwiftOrtho/issues/15#issuecomment-1464120794, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6LHZBOAZJVOBYMRLYCRUY3W3NO3XANCNFSM6AAAAAAVVAE6KY. You are receiving this because you authored the thread.

David Stephen Horner

Associate Professor (Bio/11 - Molecular Biology) Dept of Biosciences, Università degli Studi di Milano Via Celoria 26 20133 Milano Italy

@.***

genomeboy commented 1 year ago

Last (haha) question.

So the optimal solution is to run step 2 with pypy, and maybe step 3 with mcl parallel?

Thanks again

D

On 10 Mar 2023, at 18:17, Xiao @.***> wrote:

So true. I am not used to the latest version of blast. So, I still use old style: legacy_blast.pl blastall -p blastp ...

If mcl has been installed, you can directly use it "mcl foo_input --abc -te 12 -I 1.5". In our experience, mcl usually generates fewer clusters than apc. But I don't think the difference between mcl and apc can impact the conclusion. mcl also runs faster than apc because it supports multiple thread. apc is designed for large-scale genomes with limited RAM (for example, 4GB) which uses disk as cache and runs slower than mcl. If you have enough computational resources, you can try mcl first.

The -y is the identify of HSP region not the full length.

— Reply to this email directly, view it on GitHub https://github.com/Rinoahu/SwiftOrtho/issues/15#issuecomment-1464120794, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6LHZBOAZJVOBYMRLYCRUY3W3NO3XANCNFSM6AAAAAAVVAE6KY. You are receiving this because you authored the thread.

David Stephen Horner

Associate Professor (Bio/11 - Molecular Biology) Dept of Biosciences, Università degli Studi di Milano Via Celoria 26 20133 Milano Italy

@.***

Rinoahu commented 1 year ago

So “-y” is just column 3 (index2) from the m8 format? Yes. So the optimal solution is to run step 2 with pypy, and maybe step 3 with mcl parallel? Yes. step 2 is a pure python script and pypy runs faster than cpython. If there is enough RAM, I recommend using mcl (http://micans.org/mcl/index.html) as it runs faster than find_cluster.py.

genomeboy commented 1 year ago

Nice, thank you

On 10 Mar 2023, at 22:35, Xiao @.***> wrote:

So “-y” is just column 3 (index2) from the m8 format? Yes. So the optimal solution is to run step 2 with pypy, and maybe step 3 with mcl parallel? Yes. step 2 is a pure python script and pypy runs faster than cpython. If there is enough RAM, I recommend using mcl (http://micans.org/mcl/index.html) as it runs faster than find_cluster.py.

— Reply to this email directly, view it on GitHub https://github.com/Rinoahu/SwiftOrtho/issues/15#issuecomment-1464502303, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6LHZBJJC4MULA33KHM2PULW3ONCNANCNFSM6AAAAAAVVAE6KY. You are receiving this because you authored the thread.

David Stephen Horner

Associate Professor (Bio/11 - Molecular Biology) Dept of Biosciences, Università degli Studi di Milano Via Celoria 26 20133 Milano Italy

@.***

genomeboy commented 1 year ago

Hi There,

Pre-declared - I am not in any sense a programer … however, Is there not a way (at least when the number of genomes is large), to distribute and then synthesise - in a useful way- the work for step 2?

I appreciate the ideal of minimising ram requirements, but even telephones and crap laptops have 8 or 16 G ram these days, and (at least for laptops up) 4 or more cores….

Not being rude, but, to tell the truth, I have been trying this same work with proteinortho using a serious cluster, with serious parallelisation, time/cpu /ram is not a limit, just that, already, the results I am getting with your code are interesting in that I can’t find prams for proteinortho that avoid over-splitting of clusters (and I have titrated prams using tens of thousands of core hours with huge parallelisation)). If there is a way to increase sensitivity/specificity at a relatively low increase in RAM/cpu hour requirement (potentially involving a minimum of multithreading), then IMHO, that is a direction to take.

(Sorry if this comes over as impolite, it is intended as constructive).

best

D

On 10 Mar 2023, at 22:54, David Stephen Horner @.***> wrote:

Nice, thank you

On 10 Mar 2023, at 22:35, Xiao @.***> wrote:

So “-y” is just column 3 (index2) from the m8 format? Yes. So the optimal solution is to run step 2 with pypy, and maybe step 3 with mcl parallel? Yes. step 2 is a pure python script and pypy runs faster than cpython. If there is enough RAM, I recommend using mcl (http://micans.org/mcl/index.html) as it runs faster than find_cluster.py.

— Reply to this email directly, view it on GitHub https://github.com/Rinoahu/SwiftOrtho/issues/15#issuecomment-1464502303, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6LHZBJJC4MULA33KHM2PULW3ONCNANCNFSM6AAAAAAVVAE6KY. You are receiving this because you authored the thread.

David Stephen Horner

Associate Professor (Bio/11 - Molecular Biology) Dept of Biosciences, Università degli Studi di Milano Via Celoria 26 20133 Milano Italy

@.***

David Stephen Horner

Associate Professor (Bio/11 - Molecular Biology) Dept of Biosciences, Università degli Studi di Milano Via Celoria 26 20133 Milano Italy

@.***

Rinoahu commented 1 year ago

Hi There, Pre-declared - I am not in any sense a programer … however, Is there not a way (at least when the number of genomes is large), to distribute and then synthesise - in a useful way- the work for step 2? I appreciate the ideal of minimising ram requirements, but even telephones and crap laptops have 8 or 16 G ram these days, and (at least for laptops up) 4 or more cores…. Not being rude, but, to tell the truth, I have been trying this same work with proteinortho using a serious cluster, with serious parallelisation, time/cpu /ram is not a limit, just that, already, the results I am getting with your code are interesting in that I can’t find prams for proteinortho that avoid over-splitting of clusters (and I have titrated prams using tens of thousands of core hours with huge parallelisation)). If there is a way to increase sensitivity/specificity at a relatively low increase in RAM/cpu hour requirement (potentially involving a minimum of multithreading), then IMHO, that is a direction to take. (Sorry if this comes over as impolite, it is intended as constructive). best D On 10 Mar 2023, at 22:54, David Stephen Horner @.> wrote: Nice, thank you > On 10 Mar 2023, at 22:35, Xiao @.> wrote: > > > So “-y” is just column 3 (index2) from the m8 format? > Yes. > So the optimal solution is to run step 2 with pypy, and maybe step 3 with mcl parallel? > Yes. step 2 is a pure python script and pypy runs faster than cpython. If there is enough RAM, I recommend using mcl (http://micans.org/mcl/index.html) as it runs faster than find_cluster.py. > > — > Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6LHZBJJC4MULA33KHM2PULW3ONCNANCNFSM6AAAAAAVVAE6KY. > You are receiving this because you authored the thread. > David Stephen Horner Associate Professor (Bio/11 - Molecular Biology) Dept of Biosciences, Università degli Studi di Milano Via Celoria 26 20133 Milano Italy @. David Stephen Horner Associate Professor (Bio/11 - Molecular Biology) Dept of Biosciences, Università degli Studi di Milano Via Celoria 26 20133 Milano Italy @.

Distribute and then synthesise is complicated, currently, I don't have much time on this.