FlyBase / GO-curation

For projects related to GO curation in FlyBase
MIT License
0 stars 0 forks source link

PIPES & COMMAS #58

Closed hattrill closed 1 year ago

hattrill commented 1 year ago

Just adding some notes to look at this issue before Xmas break: Issue in https://flybase.atlassian.net/browse/WEB-2095

The 05 XML has this entry:

inferred from electronic annotation with InterPro:IPR001154, InterPro:IPR001241, InterPro:IPR002205, InterPro:IPR013506, InterPro:IPR013757, InterPro:IPR013758, InterPro:IPR013759, InterPro:IPR013760, InterPro:IPR018522

The 06 XML is slightly different in format:

inferred from electronic annotation with InterPro:IPR001154,InterPro:IPR001241,InterPro:IPR002205,InterPro:IPR013506,InterPro:IPR013758,InterPro:IPR013759,InterPro:IPR018522

This resulted in wrapping issues that Jim has fixed.

Need to have a look at the Input and Output of these lines:

I’ve just found an example with UniProt IDs in strings that is on FB2022_05, so may be something that’s been around but not picked up for a while as we don’t have many with long strings of IDs and only shows if you are on a diddy screen.

As you’ve pointed out, for the interpro ID, the reason that this is showing up now seems to be associated with the change from “comma-space” to “comma”.

In the past, it looks like in the GAF output we’ve used a “comma” for InterPro, but I think that a pipe separator would be more appropriate as comma = AND and pipe is “OR”. As I understand it, this is not the way it is stored in chado so I need to understand a bit more how this “transformation” is handled in the pipeline.

will just add these IDs for myself, as they give a range different cases to think about:

Q9W0H3 ; FBgn0035206 ; sturkopf

Q9VM50 ; FBgn0031882 ; Rab30

Q7KVP9 ; FBgn0261850 ; Xpd

hattrill commented 1 year ago

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | COMMA (AND) | PIPE (OR) | TOTAL | PIPES that are InterPro | PIPES that are not InterPro -- | -- | -- | -- | -- | -- GAF06 | 2544 | 8 | 2552 | 2 | 6 GAF05 | 543 | 722 | 1265 | 716 | 6 GAF04 | 544 | 691 | 1235 | 685 | 6

So, looks like we need to make sure that syntax in chado for InterPro annotations is the same as when the gene2GO pipeline was in place.

hattrill commented 1 year ago

There are worrying few pipe separated lines that are non-InterPro in withs: FB:FBgn0025583|FB:FBgn0034329 FB:FBgn0036752|FB:FBgn0038172 FB:FBgn0037690|FB:FBgn0038165 FB:FBgn0038348|FB:FBgn0038349|FB:FBgn0267408 FB:FBgn0038965|FB:FBgn0259483 FB:FBgn0039132|FB:FBgn0043012

hattrill commented 1 year ago

For gene_association file from P2GO: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | COMMA (AND) | PIPE (OR) -- | -- | -- GPAD_INTERPRO only | 0 | 5508 GPAD_FlyBase_only | 397 | 93

hattrill commented 1 year ago

For non-InterPro annotations, looks like pipes are being 'translated' into commas in the chado load: GAF FBgn0001233 Hsp83 involved_in GO:0070922 FB:FBrf0238945|PMID:29775584 IGI FB:FBgn0037728,FB:FBgn0266599 FBgn0001233 Hsp83 involved_in GO:0070922 FB:FBrf0247227|PMID:33176138 IPI FB:FBgn0036020,FB:FBgn0087035,FB:FBgn0262739 gp_ass from P2GO: UniProtKB:P02828 RO:0002331 GO:0070922 PMID:29775584 ECO:0000316 FB:FBgn0037728|FB:FBgn0266599 2022-02-07 FlyBase UniProtKB:P02828 RO:0002331 GO:0070922 PMID:33176138 ECO:0000353 FB:FBgn0036020,FB:FBgn0087035,FB:FBgn0262739 2022-02-07 FlyBase

hattrill commented 1 year ago

So, in all instances, need to make sure that pipes and commas are correctly translated into and out of chado. For display on website, both should be displayed as comma separated strings with no distinction as not an issue in display that needs to be resolved.

hattrill commented 1 year ago

From test load:

  COMMA (AND) PIPE (OR) TOTAL PIPES that are InterPro PIPES that are not InterPro
TEST 514 2010 2524 1983 27
GAF06 2544 8 2552 2 6
GAF05 543 722 1265 716 6
GAF04 544 691 1235 685 6