ebi-pf-team / interproscan

Genome-scale protein function classification
Apache License 2.0
292 stars 67 forks source link

Domain sequence generation from location indexes #60

Closed a-ag closed 6 years ago

a-ag commented 6 years ago

In case of multiple json objects being present within location, should the order in which these objects are present be used to generate domain sequence? Or should these objects be arranged in ascending order and then used for domain sequence generation?

"locations" : [ { "start" : 114, "end" : 143, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 17, "end" : 46, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 313, "end" : 342, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 226, "end" : 255, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" } ]

anadon commented 6 years ago

Assuming stable processing of input, why would order be significant to you? Typically one does not make any assumptions about output ordering unless otherwise specified.

a-ag commented 6 years ago

To extract the domain sequence, which as shown in above example is chunked in the protein. Now the sequence can be extracted in the following way, (114-143) + (17-46) + (313-342) + (226-255), or (17-46)+(114-143)+(313-342)+(226-255), and I am confused which method to follow in order to extract it. I am not making any assumptions, but clarifying for the sake of accuracy.

gsn7 commented 6 years ago

the location objects are not ordered. ordering them is easy, for example, given locations A and B, if A.start < B.start, then A, B else B,A

a-ag commented 6 years ago

To reiterate, from the json output when extracting the domain sequence(in cases where multiple location objects are there within the same match object) from the protein sequence, I should order the locations first based on indexes and then perform substring extraction from protein sequence? And the sequence in which locations objects occur does not have any significance? I apologize for repeating the question, just want to make sure that I am doing this part correctly. Thanks a lot for the answer @gsn7.

gsn7 commented 6 years ago

i might be misunderstanding your question.

if you use interproscan to characterise a sequence say MGNSGVIVL and interproscan tells you there is some_other_named domain match at locations 3-6 of the sequence. then that is all interproscan is telling you. interproscan is not telling you anything else, not that other parts of the sequence are insignificant or significant.

the following tutorials might help to undertand the output you get from interproscan: https://www.ebi.ac.uk/training/online/course/interpro-quick-tour https://www.ebi.ac.uk/training/online/course/interpro-functional-and-structural-analysis-protei

a-ag commented 6 years ago

I understand that part. However, my question is, what if there are multiple location objects within the json for a particular match.

e.g.{ "sequence" : "MTKLFAPAAPITTTLLVEGMHCGGCTSRVEQALAQVPGVTGAVADLAAGTATVAAASAIDTARLVAALDAAGYRATVATAPAATGNADARHGRARDEDDDAAAAPHTAAVTLTIGGMTCGGCARRVEQALAAVRGVADAKVDLATTSAKASVARDVDSQTLVAAVEQAGYRANVVRDARAEAAPKPAACPFEDAARSAAPAAAFAVDESSAASPERVATQSFEFDIAGMTCASCVGRVEKALAQVPGVVRATVNLATEKAAVDADADAHVDTARLIDAVKRAGYRASPVSDPASALAPSPEIAAARTAIELDIAGMTCASCVGRVEKALAQVPGVARATVNLATEKATVDADADAHVDTARLIDAVKRAGYRASPAIAACAPASRATATADAAAARPASPSADDRKLAEARRERALVIASAVLTTPLALPMFAAPFGVDAALPAWLQLALASIVQFGFGARFYRAAWHALKARAGNMDLLVALGTSAAYGLSIWLMLRDPGHAAHLYFEASAVIVTLVRFGKWLEARAKRQTTDAIRALNALRPDRARIVEHGVERDVPLAQVRVGTVVRVLPGERVPVDGRIEAGVTHVDESLITGESLPVPKGPGERVTAGSINGEGALTVATTAIGAETTLARIIRLVESAQAEKAPIQRLVDRVSAVFVPAIVAIAFATFAGWLVAGAGVETAILNAVAVLVIACPCALGLATPAAIMAGTGVAARHGVLIKDAQALELAQRARIVAFDKTGTLTQGRPTVTAFDAIGIPRGDALALAAAVQRASAHPLARAVVAAFDADADARRSSLAAAHADTPRAVAGRGVEARVDARLLALGSTRWRDELGIAVPDGVARRAAALEAAGNTVSWLMRADAPREALALVAFGDTVKPNARRAIERLAARGIRSALVTGDNRGSATAVAASLGIDEVHAQVLPDDKARVVAQLKATAGDGAVAMVGDGINDAPALAAADLGIAMATGTDVAMHTAGITLMRGDPALVADAVDISRRTYRKIQQNLFWAFVYNLVGIPLAALGWLNPMIAGAAMAFSSVSVVTNALLLRRWKGDAR", "md5" : "ff8676b457deaac060907e6e96b1fd07", "matches" : [ { "signature" : { "accession" : "PS01047", "name" : "HMA_1", "description" : "Heavy-metal-associated domain.", "type" : null, "signatureLibraryRelease" : { "library" : "PROSITE_PATTERNS", "version" : "20.132" }, "models" : { "PS01047" : { "accession" : "PS01047", "name" : "HMA_1", "description" : "Heavy-metal-associated domain.", "key" : "PS01047" } }, "entry" : { "accession" : "IPR017969", "name" : "Heavy-metal-associated_CS", "description" : "Heavy-metal-associated, conserved site", "type" : "CONSERVED_SITE", "goXRefs" : [ { "identifier" : "GO:0030001", "name" : "metal ion transport", "databaseName" : "GO", "category" : "BIOLOGICAL_PROCESS" }, { "identifier" : "GO:0046872", "name" : "metal ion binding", "databaseName" : "GO", "category" : "MOLECULAR_FUNCTION" } ], "pathwayXRefs" : [ ] } }, "locations" : [ { "start" : 114, "end" : 143, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 17, "end" : 46, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 313, "end" : 342, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 226, "end" : 255, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" } ] } ], "crossReferences" : [ { "identifier" : "ERR341275_01529", "name" : "copper-translocating P-type ATPase", "databaseName" : null, "description" : null } ], "sequenceLength" : 1061 }

If you look at the above json example, locations is a json array within the match object(I am assuming that in case a case, the domain sequence is not present as a contiguous sequence but in chunks distributed over the protein). My question is that in this case how do I extract the domain sequence? Specifically, whether domain sequence would be:

  1. "substring_protein(114-143) + substring(17-46) + substring(313-342) + substring(226-255)", OR
  2. "substring(17-46) + substring_protein(114-143) + substring(226-255) + substring(313-342) "?

I tried to look at the source database as well, but since domain sequence is expressed as a regex, it's difficult to be sure. I also looked through the tutorials, but such an example has not been discussed.

Again, @gsn7 @anadon thanks for helping me out with the same.

anadon commented 6 years ago

But that is what we are getting at. That ordering you are asking about isn't guaranteed.

What is the thing you are trying to use this with? It must be for some kind of script, right?

a-ag commented 6 years ago

Yes, the script simply aims to extract the domain sequences for a particular protein sequence. And in cases where locations has multiple json objects, I am confused whether to extract all substrings and combine them to form the domain sequence, or if each substring is a domain by itself. And if I need to combine them, then should I sort the json objects within location in an ascending format and then combine?

anadon commented 6 years ago

Let me preface, @gsn7 knows this stuff way better than I do.

Based on the output format, they only make sense if each grouping is independent. So each would need to be extracted, and operated on. If you can, it might help us if you posted your script somewhere. Just make sure it is permissible to do so.

a-ag commented 6 years ago

@gsn7 Can you comment on whether groupings are independent or combine together to form the domain sequence? Please see the above four comments. Thanks in advance.

gsn7 commented 6 years ago

ok, i think the biological question you are trying to ask is whether interproscan can distinguish between continuous and discontinuous domains. and the answer, currently, is no. in a few months time we should be able to partially model discontinuous domains. but this will be explicit in the output.

ghost commented 6 years ago

Thank you @anadon and @gsn7 for your answers above. I'm working with @akshay0193 on this. Let me see if I can summarize what we think the truth to be.

  1. InterproScan does not distinguish between contiguous e.g. one location or discontinuous e.g. multi-location domains.
  2. A domain entry with multiple locations is only showing the multiple regions that match the regex in the domain reference.
  3. Therefore, in the output, the order of the locations for a discontinuous domain does not currently allow for concatenation of subregions of the input protein sequence into a single domain sequence.
  4. The above means that for discontinuous domains that the multiple locations are a set where order does not matter and are not a series of locations.

Does that sound correct? And for our understanding, for discontinuous domains, how is the order of the locations determined— just based on whatever matches first?

gsn7 commented 6 years ago
  1. correct
  2. correct, and the majority of our analyses use profile HMMs to find domains
  3. correct, we don't have any information to determine order
  4. correct

determining domain boundaries for discontinuous domains varies. in general, this is done as part of the building of the profile HMM by the member databases of InterPro and uses MSA, 3D structures,etc. in a straight forward case, say domain B (10-14) is an insertion into domain A (4-20), Interproscan will in future report two domains: domain A (4-9)+(15-20) and domain B (10-14)

ghost commented 6 years ago

This is very helpful. Thank you for the additional clarifications. We look forward to the future updates.