PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

Truncated entity names in extended SIF (mostly KEGG pathways) #223

Closed cannin closed 8 years ago

cannin commented 9 years ago

Entity names are truncated with ... in KEGG data. Example:

(5Z,8Z,11Z,14Z... consumption-controlled-by ACOT1 KEGG Biosynthesis of unsaturated fatty acids

IgorRodchenkov commented 8 years ago

In fact, these names were not truncated; it's a different story from the already fixed similar issue about pathway names. Values like "5Z,8Z,11Z,14Z..." occur in the standardName property. So, we can simply remove those (set null) or replace with the displayName value.

cannin commented 8 years ago

Please replace with the entire, complete name if that's is the displayName.

IgorRodchenkov commented 8 years ago

What do you mean by entire, complete (if the data provider did not pick which one is the standard name, we can only pick the displayName, a random name from those available, or none)?

That’s what I was thinking of. Usually, there’s like: displayName=“PANK” (because it’s kinda pseudo-generic PR, not a specific one, such as PANK2); standardName="PANK1, MGC24596, PANK, PANK1a, PANK1b…”, and the listed names are also separate values of the ‘name’ property (and there are more names usually). I am definitely not going to list all the names in the standardName; will use either “PANK”, like in the above case, or set null (in this case, it will be later, in cPath2 Premerge stage, auto-replaced with the shortest name) if no displayName there present. Ok?

cannin commented 8 years ago

If the displayName completes: 5Z,8Z,11Z,14Z... then we should use it. If it does not, then we have to go to KEGGTranslator and have them not truncate the name.

IgorRodchenkov commented 8 years ago

Hey, Augustin... "5Z,8Z,11Z,14Z..." should not be a name, especially not a standard name at all. I haven't seen yet ER or PE objects there (translated KEGG) that have displayName like "5Z,8Z,11Z,14Z...". As I mentioned earlier, it's like displayName=“PANK” (a ProteinReference's).

Also, perhaps, we just waste time on this minor data issue... Don't you agree?

cannin commented 8 years ago

SIF entries that include "SOMETHING..." are useless. They need to be fixed. They can be replaced with PubChem/CHEBI IDs or the some more complete name that fills in the "...". This in not a trivial issue in that it make ~10% of the KEGG SIF entries hard to use. If (like the pathway names) there are unintended duplicates because of the truncation then it is not a waste of time to resolve.

cannin commented 8 years ago

Maybe we're misunderstanding each other. Here's another example: (25S)-3alpha,7alpha... By "complete" name I mean this: (25S)-3alpha,7alpha,12alpha-Trihydroxy-5beta-cholestan-26-oyl-CoA from http://www.genome.jp/dbget-bin/www_bget?cpd:C17343 I'm not sure if there is the belief that "(25S)-3alpha" is 1 possible name because of the commas, but this is not the case. I want to be sure there are no unintended duplicates because of the truncation down to "(25S)-3alpha,7alpha...". This issue is happening on almost all the lines that have chemical names in the KEGG SIF. Given a "complete" chemical name, it should be possible to draw the chemical or at least get a sense of the type of chemical involved, but "(25S)-3alpha,7alpha..." is insufficient for any use.

You could replace with the “complete” name or one of the identifiers: PubChem: 96023700 ChEBI: 81017 KEGG Compound: C17343

IgorRodchenkov commented 8 years ago

Ok, got it. It was unclear in the first message that we're talking about SmallMoleculeReferences, and SIF output; with examples...

So we're trying to fix examples like this:

Apo-[acyl-carrier-protein]/bp:name Apo-[acyl-carrier-pr.../bp:displayName C03688/bp:standardName /bp:SmallMoleculeReference And as I can see from other examples, their chemical 'displayName' is usually the truncated comma-separated list of the values from the 'name' property, and 'standardName' is a KEGG Compound ID. And it's vice versa for proteins (PRs), where displayName is ok, but standardName is build using names and "..." Besides, current pattern/SIF framework (using default CommonIDFetcher) gets displayNames of small molecule references. It was probably a quick-fix by Ozgun (probably because displayName is there usually present originally or auto-set by the Normalizer from all names). I am trying to implement another IDFetcher, more configurable, where one could e.g., define preferred ID types and their order for writing to SIF... Unfortunately, unless you always use absolute URIs as IDs for SIF, you gonna lost some of inferred binary interactions, depending on which ID type you use, if the data isn't 100% perfect... We should probably always use URIs instead of unreliable names and IDs for our SIF output (names and IDs can be found from the extended SIF then.) Considering all the said above, I'd quick-fix the 'displayName', for now, by updating the KEGG chemical's displayName to either a) one of 'name' property values there available (i.e., use the shortest, or partially matching name); or b) concatenate all the names (though displayName is supposed to be short...); or c) use the existing standardName, which seems to be a KEGG Compound ID and is always there). Similarly, for protein etc. sequence entity references, but we are going to be fixing 'standardName' instead (actually, we can simply remove those standardName that end with "..." from sequence ERs; it's ok).
IgorRodchenkov commented 8 years ago

BTW, interestingly, now that you've noticed: "...issue is happening on almost all the lines that have chemical names in the KEGG SIF." - despite there're valid ChEBI IDs - helped me catch the issue #225. OMG.

IgorRodchenkov commented 8 years ago

Despite the latest commit (KeggHsaCleanerImpl), there're still some cases where display names like "blah..." slipped through and made it to the final BioPAX model (these were probably recovered by normalizer.fixDisplayNames, which, if the display and standard name are null, uses the shortest available name)...

IgorRodchenkov commented 8 years ago

The latest commit in the KeggHsaCleanerImpl is for future releases... In mean time, I also addressed this issue in Paxtools's pattern - CommonIDFetcher (will avoid names having "..." in the SIF output), and will simply re-generate 'All', 'Detailed' and 'kegg_hsa' SIF output archives shortly...

cannin commented 8 years ago

I'll be happy to look at the output when you've prepared it. I think the solution I would favor would be to get Andreas to remove the truncation. It is a problem that we brought on ourselves with the recommendation to truncate. Is the KEGGTranslator code available somewhere? It seems like a pretty modification to make. Have you contacted Andreas? We are coming up with clever solutions without solving the trivial source of the problem.

IgorRodchenkov commented 8 years ago

I am writing to Andreas right now, and will ask him to share the code on Github if possible...

On Thu, Nov 12, 2015 at 2:33 PM, Augustin Luna notifications@github.com wrote:

I'll be happy to look at the output when you've prepared it. I think the solution I would favor would be to get Andreas to remove the truncation. It is a problem that we brought on ourselves with the recommendation to truncate. Is the KEGGTranslator code available somewhere? It seems like a pretty modification to make. Have you contacted Andreas? We are coming up with clever solutions without solving the trivial source of the problem.

— Reply to this email directly or view it on GitHub https://github.com/PathwayCommons/cpath2/issues/223#issuecomment-156210344 .