A few questions - Githubissues

Hi,

Currently, I am looking for a way to split phylogenetic trees into their separate branches and retrieve the corresponding fastas, in order to make a rather large database. Which is surprisingly difficult apparently. I then came across your program which seems to be able exactly what I want, including creating HMMs (and cutoff points) afterwards.

A few questions tho. 1) I have no experience in matlab, so I was wondering if it possible to run the program in a loop over a multitude of input trees. Either in linux command line or within matlab. With that, it is possible to circumvent the manual input? I am willing to trade in some precision for speed. 2) What is the difference between the training fasta and target fasta? 3) In the training graph, what is on the X-axis? 4) In the manual V1 it says there is a folder called 'Additional'. But I can't locate it. 5) When running Classification I'm getting errors on start, but my lack of Matlab means I don't even know whether it is the files, program or dependencies that are causing problems.(I do have yet to clean up the fasta headers):

Error using matlab.system.internal.executeCommand Arguments must contain a character vector.

Error in dos (line 66) [varargout{1:nargout}] = matlab.system.internal.executeCommand(varargin{:});

Error in HMMerCTTer_Application>HMMerCTTer_Application_OpeningFcn (line 86) Folder_Group='Groups'; system(['mkdir ' Folder_Application '/' Folder_Group]');

Error in gui_mainfcn (line 220) feval(gui_State.gui_OpeningFcn, gui_hFigure, [], guidata(gui_hFigure), varargin{:});

Error in HMMerCTTer_Application (line 42) gui_mainfcn(gui_State, varargin{:});

Error in Classification>Accept_Callback (line 125) HMMerCTTer_Application(FolderClustering, FolderClassification,ApSeqFile,Aligment_Program,Prefix);

Error in gui_mainfcn (line 95) feval(varargin{:});

Error in Classification (line 42) gui_mainfcn(gui_State, varargin{:});

Error in matlab.graphics.internal.figfile.FigFile/read>@(hObject,eventdata)Classification('Accept_Callback',hObject,eventdata,guidata(hObject))

Error while evaluating DestroyedObject Callback.

Thanks !

Geert.

Dear Geert,

Let me start with a few questions, since your initial description is not completely clear to me. 1 What is exactly your objective? Split a tree and obtain corresponding sub-multi-fastas is not a big deal. You could do it using Dendroscope and a fetching script. But since you state you want HMM profiles with cut-off, I suspect HMMERCTTER is indeed what you need You need a meaningful division!!. 2 Just to be sure: you read the paper and started with a clustering and used the clustering output as input for your classification? That is how it is supposed to work!

Then answers Q1 In order to put it in a loop running several cases (trees with fastas) it would need to be automated. We are working hard on version 2, that is written in Python and will be fully automated. The problem is that this is very hard. The ACD training tree has at least 20 Million different partitions at a coverage of 95%, which partition is the one you want? It is not a question of loosing a bit of accuracy, we need to map what is happening and then apply heuristics. We are making progress but my guess is that we will at least need another half a year. Q2 So this is why I doubt you read the paper. HMMERCTTER clusters a trainingset (Hifi phylogeny with corresponding fasta) and uses that clustering to classify target sequences (could be any sequence set). If you are not familiar with the difference between clustering and classification: the first groups items without prior knowledge, the latter groups items to an existing partition (=combination of clusters). Q3 It is actually a combined graph, I presume you refer to the red line? That is the differential score (hence it shows you sharp drops and helps deciding where to put the cut-off: a good group is well separated by score, hence the cutoff occurs prior a large score-drop. Q4 Sorry for that, we forgot. We will put it ASAP in a repository. It is named FastaHeaderConvert.py (hence it is python not PERL) and converts the header to a ten-digit number. It also creates a csv with the corresponding relations. Q5 So here I need have an answer on my question 2, since if you just start right away with a classification, it seems we have an answer. I am also not familiar with MatLab so I would need to ask the first author, who is in Ireland and no longer part of our research group.

Groetjes Arjen

The script can be found here: https://github.com/BBCMdP/Accessory_Scripts

Hallo Arjen,

Thank you for your reply.

To start with your questions. 1) The initial, global idea was to make a tree for a specific protein and separate the branches. The fasta files of each branch would then be converted into hmms. I was not yet at the part of looking into a cut-off point, so that was an added bonus here. Second bonus was that this program had a clear clustering method, which also saved me the future headache of deciding how to break up the tree.

After I manage to do 1 protein, I planned on doing 8000 more. So I would end up with a database with greater HMM resolution for each protein.

2) I did read the paper (especially the method part and S1 Appendix), but it was not all clear to me, hence the questions. Bear in mind that making trees and clustering isn't my normal cup of tea. But I started on a path that let me here.

So let me explain what I did so far. I took a protein with a certain metabolic function (for instance alcoholdehydrogenase), downloaded corresponding entries from Uniprot. Those entries I have aligned and made into a tree. At this point, I had entered the results into the first part of HMMRCTTR(clustering), which resulted in 24 txt files containing the fasta headers of different groups(clusters) of ADH, along with some other files. (no fasta or hmm, save for one, which appears to be the initial training HMM).

I then start the second part (classification), which gives me the error I provided at the start of this thread.

From the paper and trials that I did, I understood that this is the part where the HMM profiles are being created and a cut-off score is calculated, along with some refinement of the clusters.

It is also here that my confusion arises between the training and target sequence, since I am working with only one dataset that I am trying to cluster and classify. I'm feeling that I am missing something in my understanding of either the pipeline or purpose.

Lastly about Q3, I was referring to the numbers. Do they depict the number of bps, number of entries or a completely different value altogether?

Thanks again!

Groet, Geert

Hi Geert,

Great to find somebody that shares our interest. I am not sure however if you realize the task you are up to. Besides that we are busy on version two, in the hands of PhD student Agustín Amalfitano, we also work on making a phylogenomics database similar to Panther, PhD student Nico Stocchi. So I have two students doing this for their PhD thesis. So, what you want is no easy feat.

To add a bit of our initial objective: sequence mining of superfamsilies is hard, very hard. This in particular when your superfamily is complex (e.g. P450). The major problem is however pseudogenes and incorrect genemodels. HMMERCTTER allows you to do a sequence mining on valid high quality datasets, make a high quality MSA and high quality phylogeny. A hifi phylogeny will take about 3 to 6 months of work. Poor phylogenies will not work, as demonstrated by the ACD case.

What "Clustering" should output is two directory and a number of files. All inside a folder you name in the initial interface.

[image: image.png] In order to run the classification you need to point to the folder un level up. The actual data are in Training_Output that contains the images made by dendroscope and simple textfiles that contain the codes of the sequences. The Clustering script did make the HMMER profiles, but it threw them away. So indeed Classificatio starts again by fetching the sequences, aligning, hmmbuild et cetera. Classification merges your target and training dataset, does the above and then starts to do hmmsearch and it will add additional sequences that score above the threshold. The tricky part is the clustering, if there is a discongruence between the codes, you get all sorts of errors. But it seems your clustering worked well, hence I do not get the errors.

Can you compress the clustering output folder send it to my mail tenhave.arjen@gmail.com ? We ll work it out further by mail, once solved I will put a resume on the Github page.

Cheers Arjen

On Thu, Oct 11, 2018 at 12:12 PM gcremers notifications@github.com wrote:

Hallo Arjen,

Thank you for your reply.

To start with your questions.

The initial, global idea was to make a tree for a specific protein and separate the branches. The fasta files of each branch would then be converted into hmms. I was not yet at the part of looking into a cut-off point, so that was an added bonus here. Second bonus was that this program had a clear clustering method, which also saved me the future headache of deciding how to break up the tree.

After I manage to do 1 protein, I planned on doing 8000 more. So I would end up with a database with greater HMM resolution for each protein.

I did read the paper (especially the method part and S1 Appendix), but it was not all clear to me, hence the questions. Bear in mind that making trees and clustering isn't my normal cup of tea. But I started on a path that let me here.

So let me explain what I did so far. I took a protein with a certain metabolic function (for instance alcoholdehydrogenase), downloaded corresponding entries from Uniprot. Those entries I have aligned and made into a tree. At this point, I had entered the results into the first part of HMMRCTTR(clustering), which resulted in 24 txt files containing the fasta headers of different groups(clusters) of ADH, along with some other files. (no fasta or hmm, save for one, which appears to be the initial training HMM).

I then start the second part (classification), which gives me the error I provided at the start of this thread.

From the paper and trials that I did, I understood that this is the part where the HMM profiles are being created and a cut-off score is calculated, along with some refinement of the clusters.

It is also here that my confusion arises between the training and target sequence, since I am working with only one dataset that I am trying to cluster and classify. I'm feeling that I am missing something in my understanding of either the pipeline or purpose.

Lastly about Q3, I was referring to the numbers. Do they depict the number of bps, number of entries or a completely different value altogether?

Thanks again!

Groet, Geert

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BBCMdP/HMMERCTTER/issues/1#issuecomment-428990632, or mute the thread https://github.com/notifications/unsubscribe-auth/AjRt7FRUVL11GKHtKya1AanE62e-RLBPks5uj1_JgaJpZM4W802w .

-- Dr. Arjen ten Have Bioinformatics and Biocomputation IIB-CONICET-UNMdP Mar del Plata Argentina

https://twitter.com/DarwinianAscent http://www.scoop.it/t/bioinformatics-comparative-genomics-and-molecular-evolution http://www.scoop.it/t/darwinian-ascension

BBCMdP / HMMERCTTER

A few questions #1