Closed cannin closed 8 years ago
In fact, these names were not truncated; it's a different story from the already fixed similar issue about pathway names. Values like "5Z,8Z,11Z,14Z..." occur in the standardName property. So, we can simply remove those (set null) or replace with the displayName value.
Please replace with the entire, complete name if that's is the displayName.
What do you mean by entire, complete (if the data provider did not pick which one is the standard name, we can only pick the displayName, a random name from those available, or none)?
That’s what I was thinking of. Usually, there’s like: displayName=“PANK” (because it’s kinda pseudo-generic PR, not a specific one, such as PANK2); standardName="PANK1, MGC24596, PANK, PANK1a, PANK1b…”, and the listed names are also separate values of the ‘name’ property (and there are more names usually). I am definitely not going to list all the names in the standardName; will use either “PANK”, like in the above case, or set null (in this case, it will be later, in cPath2 Premerge stage, auto-replaced with the shortest name) if no displayName there present. Ok?
If the displayName completes: 5Z,8Z,11Z,14Z... then we should use it. If it does not, then we have to go to KEGGTranslator and have them not truncate the name.
Hey, Augustin... "5Z,8Z,11Z,14Z..." should not be a name, especially not a standard name at all. I haven't seen yet ER or PE objects there (translated KEGG) that have displayName like "5Z,8Z,11Z,14Z...". As I mentioned earlier, it's like displayName=“PANK” (a ProteinReference's).
Also, perhaps, we just waste time on this minor data issue... Don't you agree?
SIF entries that include "SOMETHING..." are useless. They need to be fixed. They can be replaced with PubChem/CHEBI IDs or the some more complete name that fills in the "...". This in not a trivial issue in that it make ~10% of the KEGG SIF entries hard to use. If (like the pathway names) there are unintended duplicates because of the truncation then it is not a waste of time to resolve.
Maybe we're misunderstanding each other. Here's another example: (25S)-3alpha,7alpha... By "complete" name I mean this: (25S)-3alpha,7alpha,12alpha-Trihydroxy-5beta-cholestan-26-oyl-CoA from http://www.genome.jp/dbget-bin/www_bget?cpd:C17343 I'm not sure if there is the belief that "(25S)-3alpha" is 1 possible name because of the commas, but this is not the case. I want to be sure there are no unintended duplicates because of the truncation down to "(25S)-3alpha,7alpha...". This issue is happening on almost all the lines that have chemical names in the KEGG SIF. Given a "complete" chemical name, it should be possible to draw the chemical or at least get a sense of the type of chemical involved, but "(25S)-3alpha,7alpha..." is insufficient for any use.
You could replace with the “complete” name or one of the identifiers: PubChem: 96023700 ChEBI: 81017 KEGG Compound: C17343
Ok, got it. It was unclear in the first message that we're talking about SmallMoleculeReferences, and SIF output; with examples...
So we're trying to fix examples like this:
BTW, interestingly, now that you've noticed: "...issue is happening on almost all the lines that have chemical names in the KEGG SIF." - despite there're valid ChEBI IDs - helped me catch the issue #225. OMG.
Despite the latest commit (KeggHsaCleanerImpl), there're still some cases where display names like "blah..." slipped through and made it to the final BioPAX model (these were probably recovered by normalizer.fixDisplayNames, which, if the display and standard name are null, uses the shortest available name)...
The latest commit in the KeggHsaCleanerImpl is for future releases... In mean time, I also addressed this issue in Paxtools's pattern - CommonIDFetcher (will avoid names having "..." in the SIF output), and will simply re-generate 'All', 'Detailed' and 'kegg_hsa' SIF output archives shortly...
I'll be happy to look at the output when you've prepared it. I think the solution I would favor would be to get Andreas to remove the truncation. It is a problem that we brought on ourselves with the recommendation to truncate. Is the KEGGTranslator code available somewhere? It seems like a pretty modification to make. Have you contacted Andreas? We are coming up with clever solutions without solving the trivial source of the problem.
I am writing to Andreas right now, and will ask him to share the code on Github if possible...
On Thu, Nov 12, 2015 at 2:33 PM, Augustin Luna notifications@github.com wrote:
I'll be happy to look at the output when you've prepared it. I think the solution I would favor would be to get Andreas to remove the truncation. It is a problem that we brought on ourselves with the recommendation to truncate. Is the KEGGTranslator code available somewhere? It seems like a pretty modification to make. Have you contacted Andreas? We are coming up with clever solutions without solving the trivial source of the problem.
— Reply to this email directly or view it on GitHub https://github.com/PathwayCommons/cpath2/issues/223#issuecomment-156210344 .
Entity names are truncated with ... in KEGG data. Example:
(5Z,8Z,11Z,14Z... consumption-controlled-by ACOT1 KEGG Biosynthesis of unsaturated fatty acids