debbiemarkslab / EVcouplings

Evolutionary couplings from protein and RNA sequence alignments
http://evcouplings.org
Other
230 stars 75 forks source link

protein ids as input #170

Closed omranian closed 6 years ago

omranian commented 6 years ago

Hi,

I would like to use your package for a list of protein pairs and calculate the EVComplex scores. Something similar to this: https://evcomplex.hms.harvard.edu/results/test_2018-05-17_130958_3381 but automatically, since my list is big and I don't want to do it online.

Is it possible that I give protein names instead of multiple alignment results and I get the complex scores for each pair of proteins? If yes, then I appreciate a lot your help on introducing me to the functions which I can use to facilitate computation of the complex scores.

looking very much forward to your response.

Best, Nooshin

aggreen commented 6 years ago

Hi Nooshin,

In order to calculate EVcomplex scores for lists of proteins without using the online server, you'll need to download and install the EVcouplings package and set up your own small script to submit the jobs. This will require a bit of overhead to get set up, but luckily it's exactly what we've designed our software to do.

Step 1: Follow the instructions here to download and install the package and required dependencies: https://github.com/debbiemarkslab/EVcouplings/ Step 2: Read the following notebook to learn how to submit jobs for the complexes pipeline using our configuration files: https://github.com/debbiemarkslab/EVcouplings/blob/develop/notebooks/running_jobs_complexes.ipynb Step 3: Write your own small script (we recommend using Python) to generate a configuration file for each pair of proteins you with to run. The notebook in step 2 shows you how to programmatically edit a configuration file using Python. I imagine this script will take in the list of ids as input. Step 4: Submit all of the configuration files you have generated using our command-line app. This can also be done with a script so that you don't have to submit each one individually.

I hope you find that helpful. I'm going to close your issue now but feel free to respond to this if you have further questions. Anna

On Thu, May 24, 2018 at 5:34 AM, Nooshin Omranian notifications@github.com wrote:

Hi,

I would like to use your package for a list of protein pairs and calculate the EVComplex scores. Something similar to this: https://evcomplex.hms.harvard.edu/results/test_2018-05-17_130958_3381 but automatically, since my list is big and I don't want to do it online.

Is it possible that I give protein names instead of multiple alignment results and I get the complex scores for each pair of proteins?

looking very much forward to your response.

Best, Nooshin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/debbiemarkslab/EVcouplings/issues/170, or mute the thread https://github.com/notifications/unsubscribe-auth/AHimXgAfU52cQFjl2tca1DPAfxsWEuBPks5t1n6XgaJpZM4UL46d .

omranian commented 6 years ago

Hi Anna,

Thanks a lot for your response. It was really helpful.

I cannot run evcouplings_dbupdate to get the required databases due to our machine which does not support any graphics.

how can I have the databases with the SIFTS-based structure mapping tables?

Thanks and best, Nooshin

omranian commented 6 years ago

I could also not find the notebook step 2 which helps to programmatically edit a configuration file using Python.

aggreen commented 6 years ago

Hi Nooshin,

No problem, we're happy to help.

Can you please elaborate on what you mean by "cannot run evcouplings_dbupdate to get the required databases due to our machine which does not support any graphics"? evcouplings_dbupdate is a command-line executable which generates text files, so there should be no graphics required. If you tried to execute the command, can you share the error message? If you are unable to execute this command I can see about creating a static repository for these files to be downloaded. They are large so we'd prefer not to host them if possible.

Regarding finding the notebook for step 2, that was my mistake, it is found in https://github.com/debbiemarkslab/EVcouplings/blob/develop/notebooks/running_jobs.ipynb which explains how to run monomer jobs. Any syntax that is the same for complexes it is not repeated in both notebooks.

Anna

On Fri, May 25, 2018 at 8:07 AM, Nooshin Omranian notifications@github.com wrote:

I could also not find the notebook step 2 which helps to programmatically edit a configuration file using Python.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/debbiemarkslab/EVcouplings/issues/170#issuecomment-392033000, or mute the thread https://github.com/notifications/unsubscribe-auth/AHimXmpOzNBMR8UU6bf8s5hhqMG-kEBBks5t1_PrgaJpZM4UL46d .

omranian commented 6 years ago

Hi Anna,

Thanks a lot for your response.

I could finally get it run by your help. I downloaded the uniref100, and I think this should be fine for now.

I'm testing the program for a single pair and indeed, it is very slow (I'm still waiting for my test run). I need the EV complex score for about 2000000 pairs (I'm doing whole genome analysis including more than 20000 genes and taking only top 100 pairs for every gene).

Could you help me with running the program faster (I used parallel package to run the program every time for 20 pairs) but this is not enough and inside the program I also defined 10 CPUs.

Is your algorithm at all possible for high throughput analysis?

Thank you very much All the best, Nooshin

omranian commented 6 years ago

There is another issue yet. As I said, the test was running and I just got the following error:

evcouplings.utils.config.MissingParameterError: Missing required parameters: hhfilter

but I don't want any filtering and I followed the comments in the config file as follow:

# Filter sequence alignment at this % sequence identity cutoff. Can be used to cut computation time in
# the couplings stage (e.g. set to 95 to remove any sequence that is more than 95% identical to a sequence
# already present in the alignment). If blank, no filtering. If filtering, HHfilter must be installed.
seqid_filter: 

and I didn't give any path to the HHfilter. Would you mind please help me how to solve this error?

Thanks and looking forward, Nooshin

aggreen commented 6 years ago

Hi Nooshin,

Regarding the missing parameter error, I think the error is because you need to provide the parameter hhfilter but can leave the value of the parameter blank - I know it can be confusing, but we require the user to explicitly define (or leave blank) all parameters so that there's no hidden defaults. I've put a text snippet below with the example.

Regarding speeding up your analysis, I have a couple of recommendations. The two computationally intensive steps are running the sequence alignment using jackhmmer, and inferring the couplings using PLMC. Both of these applications are well optimized, so unfortunately I don't think there's much speed up to be gained there. However, you could speed up by inferring your sequence alignments for each monomer individually before running your all vs. all analysis. This would mean you only have to run each alignment once instead of multiple times. More importantly, it would probably be a good idea to filter your list of all possible interactions in some way. I'm not sure what your final application is, but maybe you could choose only a subset of genes related to your application of interest, or choose only gene pairs that you think are possible to interact by some other experimental or computational measure.

Sorry that I can't be more helpful there, but unfortunately constructing a sequence alignment and inferring evolutionary couplings are both fundamentally computationally intensive tasks. I hope this can help you go in the right direction.

Anna

tools:
jackhmmer: /n/groups/marks/pipelines/evcouplings/software/hmmer-3.
1b2-linux-intel-x86_64/binaries/jackhmmer
hmmbuild: /n/groups/marks/pipelines/evcouplings/software/hmmer-3.
1b2-linux-intel-x86_64/binaries/hmmbuild
hmmsearch: /n/groups/marks/pipelines/evcouplings/software/hmmer-3.
1b2-linux-intel-x86_64/binaries/hmmsearch
plmc: /n/groups/marks/pipelines/evcouplings/software/plmc/bin/plmc
hhfilter:

On Mon, May 28, 2018 at 8:36 AM, Nooshin Omranian notifications@github.com wrote:

There is another issue yet. As I said, the test was running and I just got the following error:

evcouplings.utils.config.MissingParameterError: Missing required parameters: hhfilter

but I don't want any filtering and I followed the comments in config file as follow:

Filter sequence alignment at this % sequence identity cutoff. Can be used to cut computation time in

the couplings stage (e.g. set to 95 to remove any sequence that is more than 95% identical to a sequence

already present in the alignment). If blank, no filtering. If filtering, HHfilter must be installed.

seqid_filter:

and I didn't give any path to the HHfilter. Would you mind please help me how to solve this error?

Thanks and looking forward, Nooshin

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/debbiemarkslab/EVcouplings/issues/170#issuecomment-392515770, or mute the thread https://github.com/notifications/unsubscribe-auth/AHimXrUiHvKE8mxnIDk8l6YXvZHZerf-ks5t2-88gaJpZM4UL46d .

omranian commented 6 years ago

Hi Anna, Thank you very much for your prompt response.

The first point is what I'm doing, I do alignment first for every single gene and then use the alignment for the EV scores. But this is also not easy as I was thinking :) I have to get all parameters again for EVscores. I have to deal with reading and feeding the config file. I still need to play with config file. but it's all fine and I will get back to you in case of any problem. regarding the pairs, I already included the top 100 pairs for every gene and I cannot go below. But you are totally right, these tasks are computationally very expensive.

Regarding the filter, many many thanks. It helped a lot and I'm waiting for the result :)

Once again, thank you very much.

All the best, Nooshin

omranian commented 6 years ago

Hi Anna,

Would you mind please help me with this error?

AttributeError: 'DataFrame' object has no attribute 'uniprot_ac'

Thank you very much All the best, Nooshin

Traceback (most recent call last): File "", line 1, in File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/evcouplings/utils/pipeline.py", line 174, in execute outcfg = runner(incfg) File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/evcouplings/complex/protocol.py", line 575, in run return PROTOCOLS[kwargs["protocol"]](kwargs) File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/evcouplings/complex/protocol.py", line 520, in best_hit outcfg = _run_describe_concatenation(outcfg, **kwargs) File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/evcouplings/complex/protocol.py", line 88, in _run_describe_concatenation outcfg["concatentation_statistics_file"] File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/evcouplings/complex/protocol.py", line 179, in describe_concatenation embl_cds2 = len(list(set(genome_location_table_2.uniprot_ac))) File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/pandas/core/generic.py", line 2672, in getattr return object.getattribute(self, name) AttributeError: 'DataFrame' object has no attribute 'uniprot_ac'

aggreen commented 6 years ago

Hi Nooshin,

Unfortunately this is a known bug that I'm working on fixing. Hopefully I will push a fix by the end of the day (EST) at which point you will have to update your installation. Thanks for you patience.

Anna

On Thu, May 31, 2018 at 3:43 AM, Nooshin Omranian notifications@github.com wrote:

Hi Anna,

Would you mind please help me with this error?

AttributeError: 'DataFrame' object has no attribute 'uniprot_ac'

Thank you very much All the best, Nooshin

Traceback (most recent call last): File "", line 1, in File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/ evcouplings/utils/pipeline.py", line 174, in execute outcfg = runner(incfg) File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/ evcouplings/complex/protocol.py", line 575, in run return PROTOCOLSkwargs["protocol"] <http://kwargs> File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/ evcouplings/complex/protocol.py", line 520, in best_hit outcfg = _run_describe_concatenation(outcfg, *kwargs) File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/ evcouplings/complex/protocol.py", line 88, in _run_describe_concatenation outcfg["concatentation_statistics_file"] File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/ evcouplings/complex/protocol.py", line 179, in describe_concatenation embl_cds2 = len(list(set(genome_location_table_2.uniprot_ac))) File "/apps/devel/python/3.5.1/lib/python3.5/site-packages/pandas/core/generic.py", line 2672, in getattr return object.getattribute*(self, name) AttributeError: 'DataFrame' object has no attribute 'uniprot_ac'

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/debbiemarkslab/EVcouplings/issues/170#issuecomment-393440331, or mute the thread https://github.com/notifications/unsubscribe-auth/AHimXitEqKI9INC86UR9qxm-s3CUVZ5dks5t358GgaJpZM4UL46d .

aggreen commented 6 years ago

Hi Nooshin,

The fix for this has been tested and merged. If you want to update your current installation from the evcouplings development branch the issue should be fixed. The development branch will be released as the new master branch soon.

Anna

omranian commented 6 years ago

Hi Anna,

Many thanks for your consideration.

What is the easiest way to update my current installation? I probably need to ask our IT team to do it, but they usually ask for the exact command to run.

sorry for this and looking forward. All the best, Nooshin