Closed a-ag closed 6 years ago
Assuming stable processing of input, why would order be significant to you? Typically one does not make any assumptions about output ordering unless otherwise specified.
To extract the domain sequence, which as shown in above example is chunked in the protein. Now the sequence can be extracted in the following way, (114-143) + (17-46) + (313-342) + (226-255), or (17-46)+(114-143)+(313-342)+(226-255), and I am confused which method to follow in order to extract it. I am not making any assumptions, but clarifying for the sake of accuracy.
the location objects are not ordered. ordering them is easy, for example, given locations A and B, if A.start < B.start, then A, B else B,A
To reiterate, from the json output when extracting the domain sequence(in cases where multiple location objects are there within the same match object) from the protein sequence, I should order the locations first based on indexes and then perform substring extraction from protein sequence? And the sequence in which locations objects occur does not have any significance? I apologize for repeating the question, just want to make sure that I am doing this part correctly. Thanks a lot for the answer @gsn7.
i might be misunderstanding your question.
if you use interproscan to characterise a sequence say MGNSGVIVL and interproscan tells you there is some_other_named domain match at locations 3-6 of the sequence. then that is all interproscan is telling you. interproscan is not telling you anything else, not that other parts of the sequence are insignificant or significant.
the following tutorials might help to undertand the output you get from interproscan: https://www.ebi.ac.uk/training/online/course/interpro-quick-tour https://www.ebi.ac.uk/training/online/course/interpro-functional-and-structural-analysis-protei
I understand that part. However, my question is, what if there are multiple location objects within the json for a particular match.
e.g.{ "sequence" : "MTKLFAPAAPITTTLLVEGMHCGGCTSRVEQALAQVPGVTGAVADLAAGTATVAAASAIDTARLVAALDAAGYRATVATAPAATGNADARHGRARDEDDDAAAAPHTAAVTLTIGGMTCGGCARRVEQALAAVRGVADAKVDLATTSAKASVARDVDSQTLVAAVEQAGYRANVVRDARAEAAPKPAACPFEDAARSAAPAAAFAVDESSAASPERVATQSFEFDIAGMTCASCVGRVEKALAQVPGVVRATVNLATEKAAVDADADAHVDTARLIDAVKRAGYRASPVSDPASALAPSPEIAAARTAIELDIAGMTCASCVGRVEKALAQVPGVARATVNLATEKATVDADADAHVDTARLIDAVKRAGYRASPAIAACAPASRATATADAAAARPASPSADDRKLAEARRERALVIASAVLTTPLALPMFAAPFGVDAALPAWLQLALASIVQFGFGARFYRAAWHALKARAGNMDLLVALGTSAAYGLSIWLMLRDPGHAAHLYFEASAVIVTLVRFGKWLEARAKRQTTDAIRALNALRPDRARIVEHGVERDVPLAQVRVGTVVRVLPGERVPVDGRIEAGVTHVDESLITGESLPVPKGPGERVTAGSINGEGALTVATTAIGAETTLARIIRLVESAQAEKAPIQRLVDRVSAVFVPAIVAIAFATFAGWLVAGAGVETAILNAVAVLVIACPCALGLATPAAIMAGTGVAARHGVLIKDAQALELAQRARIVAFDKTGTLTQGRPTVTAFDAIGIPRGDALALAAAVQRASAHPLARAVVAAFDADADARRSSLAAAHADTPRAVAGRGVEARVDARLLALGSTRWRDELGIAVPDGVARRAAALEAAGNTVSWLMRADAPREALALVAFGDTVKPNARRAIERLAARGIRSALVTGDNRGSATAVAASLGIDEVHAQVLPDDKARVVAQLKATAGDGAVAMVGDGINDAPALAAADLGIAMATGTDVAMHTAGITLMRGDPALVADAVDISRRTYRKIQQNLFWAFVYNLVGIPLAALGWLNPMIAGAAMAFSSVSVVTNALLLRRWKGDAR", "md5" : "ff8676b457deaac060907e6e96b1fd07", "matches" : [ { "signature" : { "accession" : "PS01047", "name" : "HMA_1", "description" : "Heavy-metal-associated domain.", "type" : null, "signatureLibraryRelease" : { "library" : "PROSITE_PATTERNS", "version" : "20.132" }, "models" : { "PS01047" : { "accession" : "PS01047", "name" : "HMA_1", "description" : "Heavy-metal-associated domain.", "key" : "PS01047" } }, "entry" : { "accession" : "IPR017969", "name" : "Heavy-metal-associated_CS", "description" : "Heavy-metal-associated, conserved site", "type" : "CONSERVED_SITE", "goXRefs" : [ { "identifier" : "GO:0030001", "name" : "metal ion transport", "databaseName" : "GO", "category" : "BIOLOGICAL_PROCESS" }, { "identifier" : "GO:0046872", "name" : "metal ion binding", "databaseName" : "GO", "category" : "MOLECULAR_FUNCTION" } ], "pathwayXRefs" : [ ] } }, "locations" : [ { "start" : 114, "end" : 143, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 17, "end" : 46, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 313, "end" : 342, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 226, "end" : 255, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" } ] } ], "crossReferences" : [ { "identifier" : "ERR341275_01529", "name" : "copper-translocating P-type ATPase", "databaseName" : null, "description" : null } ], "sequenceLength" : 1061 }
If you look at the above json example, locations is a json array within the match object(I am assuming that in case a case, the domain sequence is not present as a contiguous sequence but in chunks distributed over the protein). My question is that in this case how do I extract the domain sequence? Specifically, whether domain sequence would be:
I tried to look at the source database as well, but since domain sequence is expressed as a regex, it's difficult to be sure. I also looked through the tutorials, but such an example has not been discussed.
Again, @gsn7 @anadon thanks for helping me out with the same.
But that is what we are getting at. That ordering you are asking about isn't guaranteed.
What is the thing you are trying to use this with? It must be for some kind of script, right?
Yes, the script simply aims to extract the domain sequences for a particular protein sequence. And in cases where locations has multiple json objects, I am confused whether to extract all substrings and combine them to form the domain sequence, or if each substring is a domain by itself. And if I need to combine them, then should I sort the json objects within location in an ascending format and then combine?
Let me preface, @gsn7 knows this stuff way better than I do.
Based on the output format, they only make sense if each grouping is independent. So each would need to be extracted, and operated on. If you can, it might help us if you posted your script somewhere. Just make sure it is permissible to do so.
@gsn7 Can you comment on whether groupings are independent or combine together to form the domain sequence? Please see the above four comments. Thanks in advance.
ok, i think the biological question you are trying to ask is whether interproscan can distinguish between continuous and discontinuous domains. and the answer, currently, is no. in a few months time we should be able to partially model discontinuous domains. but this will be explicit in the output.
Thank you @anadon and @gsn7 for your answers above. I'm working with @akshay0193 on this. Let me see if I can summarize what we think the truth to be.
Does that sound correct? And for our understanding, for discontinuous domains, how is the order of the locations determined— just based on whatever matches first?
determining domain boundaries for discontinuous domains varies. in general, this is done as part of the building of the profile HMM by the member databases of InterPro and uses MSA, 3D structures,etc. in a straight forward case, say domain B (10-14) is an insertion into domain A (4-20), Interproscan will in future report two domains: domain A (4-9)+(15-20) and domain B (10-14)
This is very helpful. Thank you for the additional clarifications. We look forward to the future updates.
In case of multiple json objects being present within location, should the order in which these objects are present be used to generate domain sequence? Or should these objects be arranged in ascending order and then used for domain sequence generation?
"locations" : [ { "start" : 114, "end" : 143, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 17, "end" : 46, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 313, "end" : 342, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" }, { "start" : 226, "end" : 255, "level" : "NONE", "cigarAlignment" : "Not available", "alignment" : "Not available" } ]