PedroMTQ / mantis

A package to annotate protein sequences
MIT License
53 stars 6 forks source link

Too many works activated when setup_databases #12

Closed xiekunwhy closed 3 years ago

xiekunwhy commented 3 years ago

Hi,

There are too many works activated when doing setup_databases even -c 1.

python mantis/ setup_databases -c 1 ps xf|grep python|wc -l 131

Is it possible to limit the jobs number when doing setup_databases ?

Best, Kun

PedroMTQ commented 3 years ago

Hey @xiekunwhy You're right, -c currently only works when running the homology search. Please note that the workload during setup is not very heavy (mostly downloading). In any case, in the next update I will add the possibility to use -c when running setup_databases. Please wait until next week when I will also be releasing the translation of CDS feature.

Regards, Pedro

xiekunwhy commented 3 years ago

Hi @PedroMTQ

Glad to hear that and hope it coming soon.

Best, Kun

xiekunwhy commented 3 years ago

It seems that mantis download hmms.tar files from eggnog database, but hmms.tar files are much larger than hmms.tar.gz files, why don't download hmms.tar.gz files instead of hmms.tar files to save time?

PedroMTQ commented 3 years ago

Thank you for pointing that out! I hadn't noticed tar.gz were available, I will change it in the next update.

Regards, Pedro

xiekunwhy commented 3 years ago

Hi @PedroMTQ

Sorry to trouble you again, setup_databases still running now (and ~30 zombie jobs produced, all are [python] defunct), so I can not test the pipeline normally. And I am looking for a tool to do GO annotation to repalce blast2go these days, so I just want to know if the the results of mantis containing GO annotation results?

Best, Kun

PedroMTQ commented 3 years ago

Hi @xiekunwhy,

Yes, GO IDs are outputted; these originally come from the Pfam and eggNOG database. If you are only interested in the GO IDs perhaps it would be better to only use the Pfam database with Mantis. To do that edit the MANTIS.config file as described here https://github.com/PedroMTQ/mantis/wiki/Configuration#setting-your-own-paths In your case, you'd set all other reference paths to NA except Pfam.

This will be much faster to setup and run.

I'm currently testing my updated version locally, so I will also take a look at the zombie jobs produced.

thanks and regards, Pedro

xiekunwhy commented 3 years ago

Thank you for your reply. And I will try it.

Another question is that will diamond blastx supported in the next update? To get accuracy CDS and protein sequences is very difficult some times when dealing with non-model species transcriptomes.

Best, Kun

PedroMTQ commented 3 years ago

I have added Diamond, but only the blastp option which does the following: "Align protein query sequences against a protein reference database." I have also added translation of CDS, which I believe to be same concept used by Diamond's blastx. However, whether one uses Mantis's translation or Diamond's will require the same input - a fasta with CDS. These (with Mantis or Diamond) are then translated and matched against a reference database.

I am not sure what you mean with "To get accuracy CDS and protein sequences is very difficult some times when dealing with non-model species transcriptomes.", do you mean gene prediction is difficult (and thus obtain the CDS)? Function prediction in general will be less reliable for non-model species, however unless we are talking about very novel species, we should still be able to find good homologs.

xiekunwhy commented 3 years ago

Hi,

"do you mean gene prediction is difficult (and thus obtain the CDS)", yes, something like that (not so exactly). I am now using denovo RNA-seq to study a dosen of medicinal plants. As we all know that there are many non-coding sequences in assembled transcripts, so I need to find coding regions first before using mantis, I am trying to use TransDecoder to do that, I think I can skip this step if the annotation tool support blastx.

Best, Kun

PedroMTQ commented 3 years ago

Hi @xiekunwhy,

I checked Diamond's paper and they do frameshift alignments (predictions forward and reverse in the 3 frames). While there may be use-cases for this methodology, we would prefer to limit the introduction of noise (i.e., homology search for sequences which may not correspond to CDS), since this can be avoided by using gene prediction tools instead. Therefore, there's no plan to implement such a feature as it deviates from Mantis' intended purpose.

Regards, Pedro

xiekunwhy commented 3 years ago

Hi @PedroMTQ ,

Thank you for letting me know that. When new version is released, I will discuss with my colleagues and cooperators to see that if we can make some modifications to accept diamond blastx mehtod or results and keep main features of mantis (i.e., text mining) and use it at our own risk after well testing.

For denovo RNA-seq assemblies, CDS predicting tools do not work so well some times, and we found that we may lost some useful annotation informations if we follow CDS predicting--functions annotating pipeline (we need to do further analysis to known why). And denovo RNA-seq is one of our daily works in the foresseeable future.

Best, Kun

PedroMTQ commented 3 years ago

Hi @xiekunwhy I think an easy solution would be for you to generate all frames as samples, you could then input them in Mantis and run it normally. Keep in mind that since your transcript contains UTRs, homology search will be more noisy. You could then converge the annotations yourself as an additional downstream process. Keep in mind that Diamond, in my local version, is currently only being used by TCDB (a new reference for transporter prediction), all other references use HMMs and therefore run with HMMER, not Diamond.

Regards, Pedro

xiekunwhy commented 3 years ago

Hi @PedroMTQ ,

Good suggestions and I will try it. Many thanks.

Best, Kun

PedroMTQ commented 3 years ago

Version 1.1 has been released. Translation has been added and the use of cores during setup is now allowed.

PedroMTQ commented 3 years ago

@xiekunwhy could you please confirm whether you still had zombie processes when you successfully ran setup_databases? I have opened an issue but can't replicate the issue. I think the zombie processes were just the NCBI infinite loop but I'm not entirely sure.