Change which data is displayed in the Pathogen and Host sections

PHI-base / PHI5_web_display

PHI5_web_display will allow to display PHI-Canto data

1 stars 0 forks source link

Change which data is displayed in the Pathogen and Host sections #56

Closed jseager7 closed 10 months ago

jseager7 commented 2 years ago

(Follow-up from #51)

The PHI-base team has recently reviewed the Pathogen and Host sections of the gene page and identified a number of problems. We've decided to clarify the requirements for these sections.

For pathogen gene pages:
- The Pathogen section should show a list of all pathogen strains that have annotations involving the gene of the gene page.
- The Host section should display a list of genes, plus the scientific name and NCBI Taxonomy ID, of any host that interacts with the pathogen (as part of a metagenotype). Wild type host genotypes (genotypes with no genes) will presumably not be included now.
For host gene pages:
- The Host section should show a list of all host strains that have annotations involving the gene of the gene page.
- The Pathogen section should display a list of genes, plus the scientific name and NCBI Taxonomy ID, of any pathogen that interacts with the host (as part of a metagenotype).
We also decided that we shouldn't show the Reference column in either the Pathogen or Host section, because the reference is included with the annotations in other tables, and the reference is not likely to work well when data is being aggregated like this.

Pathogen gene page

Below is a mockup of how the Pathogen and Host sections should appear for a pathogen gene, specifically RALF of Fusarium graminearum (FGRAMPH1_01T16205; PHIG:278).

Screenshot_20220808_164739

Note that UniProtKB has no names for the genes in the image above, and we don't export the gene names we have recorded (FER1) in the PHI-Canto JSON export independently of the allele names. So for now, we'll probably just have to display the UniProtKB accession number in the gene column when there is no gene name in UniProt.

Host gene page

Below is a mockup of how the Pathogen and Host sections should appear for a host gene, specifically Cf-4A of Solanum lycopersicum (PHIG:311).

Screenshot_20220808_165948

Note that in this case there are two strains listed in the Host section, because the Cf-4A gene has been annotated as part of two strains. The current interface only displays "cv. Moneymaker", which is incorrect. The pathogen gene also has a name in this case because the name exists in UniProtKB.

The mockup above shows row grouping in the Pathogen section so that the host name and taxon ID is not repeated every row: this would be nice to have, but is not absolutely required.

@Molecular-Connections Since the logic to extract the correct data from the export could be quite difficult, I could include summary lists of strains and interacting genes for each gene in the new JSON export format, so for these sections you would only have to display data that is already in the export.

Alternatively, I could provide instructions (pseudocode) for how to extract the data from the current export format.

Please let me know what you'd prefer.

jashobanta-mcpl commented 2 years ago

Thanks @jseager7 . Please let us know the steps for implement it with current JSON format . We will implement it in new export format later .

jseager7 commented 2 years ago

@jashobanta-mcpl See below for a description of how to populate the Pathogen and Host sections with the current export format.

Get all strains for a gene

When the organism of the gene page is a pathogen, then the Pathogen section should contain all the strains linked to the genotypes that contain the page's pathogen gene. When the organism of the gene page is a host, then the Host section should contain all the strains linked to the genotypes that contain the page's host gene.

The strains can be found by searching all the genotypes in the export and checking whether or not they contain the page's gene. Alternatively, the strains of the genotypes already linked to the gene page could be searched.

Pseudocode

set page_gene to the UniProtKB accession number of the gene page (e.g. "Q00909" for PHIG:253)
for each session in curation_sessions:
- for each genotype in session.genotypes:
- for each locus in genotype.loci:
  - for each locus_allele in locus:
  - set strain_name to the value of the organism_strain property of the genotype object
  - set allele_id to the value of the id property of the locus_allele object
  - look up allele_id in the property names of the alleles object (which is in the session object)
  - set gene_id to the value of the gene property of the matching allele object
  - look up gene_id in the property names of the genes object (which is in the session object)
  - set uniprot_id to the value of the uniquename property of the matching gene object (this is the UniProtKB accession number)
  - if uniprot_id equals page_gene:
    - if the organism of the gene page is a pathogen:
    - display the strain_name in the "Experimental strain" column of the Pathogen section
    - else:
    - display the strain_name in the "Host strain" column of the Host section

Get all interacting genes for a gene

For example, when the organism of the gene page is a pathogen, then the Host section should contain all the host genes that interact with the pathogen gene. For pathogen genes, the "interacting genes" are any host genes in a metagenotype with the pathogen gene, or any host genes in a physical interaction with the pathogen gene (vice versa for host genes).

The interacting genes can be found by searching all annotations, filtering for metagenotype annotations and physical interactions, checking whether the interaction contains the pathogen (or host) gene, then extracting all the host (or pathogen) genes from the interaction.

The interacting genes in the Pathogen or Host gene section should be listed by primary gene name, but since the gene name is not included in the JSON export (only the allele name is included), the primary gene name must be retrieved from UniProtKB, or queried from the PHI-base 5 database if the gene name is already stored there.

Pseudocode

The first step is to get the interacting genes for the page's gene:

set interacting_genes to an empty list
set page_gene to the UniProtKB accession number of the gene page (e.g. "Q00909" for PHIG:253)
for each session in curation_sessions:
- for each annotation in session.annotations:
- if the annotation object has a "metagenotype" property:
  - if the organism of the gene page is a pathogen:
  - look up metagenotype.pathogen_genotype in the property names of the genotypes object (which is in the session object)
  - set pathogen_genotype to the matching pathogen genotype
  - search through the genes linked to pathogen_genotype (loci → alleles → genes) and check if any of the genes match the page_gene (based on the UniProtKB ID)
  - if no matching gene is found:
    - go to the next annotation
  - else:
    - set interacting_genotype to metagenotype.host_genotype
  - else: (if the organism of the gene page is a host)
  - look up metagenotype.host_genotype in the property names of the genotypes object (which is in the session object)
  - set host_genotype to the matching host genotype
  - search through the genes linked to host_genotype (loci → alleles → genes) and check if any of the genes match the page_gene (based on the UniProtKB ID)
  - if no matching gene is found:
    - go to the next annotation
  - else:
    - set interacting_genotype to metagenotype.pathogen_genotype
  - for each locus in interacting_genotype.loci:
  - for each locus_allele in locus:
    - set allele_id to the value of the id property of the locus_allele object
    - look up allele_id in the property names of the alleles object (which is in the session object)
    - set gene_id to the value of the gene property of the matching allele object
    - look up gene_id in the property names of the genes object (which is in the session object)
    - add the value of the uniquename property of the matching gene object to the interacting_genes list
- else if annotation.type equals "physical_interaction":
  - set gene_id to annotation.gene
  - look up gene_id in the property names of the genes object
  - set uniprot_id to the value of the uniquename property of the matching gene object
  - if uniprot_id equals page_gene:
    - add uniprot_id to the interacting_genes list
  - else: (check to see if the interacting gene matches instead)
    - set gene_id to the first item of annotation.interacting_genes
    - look up gene_id in the property names of the genes object
    - set uniprot_id to the value of the uniquename property of the matching gene object
    - if uniprot_id equals page_gene:
    - add uniprot_id to the interacting_genes list

Then the list of interacting genes must be filtered to remove duplicates, and the primary gene names must be retrieved and displayed:

set unique_interacting_genes to a unique list of values from interacting_genes (drop any duplicates)
for each uniprot_id in unique_interacting_genes:
- look up the uniprot_id in UniProtKB and get the primary gene name for the accession, or get the gene name from the PHI-base 5 database
- if the organism of the gene page is a pathogen:
- add the primary gene name to the "Host gene" column of the Host section
- add a link from the primary gene name to the gene page of the corresponding host gene
- else:
- add the primary gene name to the "Pathogen gene" column of the Pathogen section
- add a link from the primary gene name to the gene page of the corresponding pathogen gene

The primary gene name of the gene can be retrieved from the XML format of the UniProtKB accession:

<gene>
  <name type="primary">TRI5</name> <!-- primary name -->
  <name type="ORF">FGRRES_03537</name>
  <name type="ORF">FGSG_03537</name>
</gene>

If there is no primary name, then an ORF name should be displayed. If there are no gene names, then the UniProtKB accession number should be displayed.

jseager7 commented 2 years ago

Also, to find out whether a gene is a pathogen or host gene, you have to search each metagenotype in the export to check whether the gene is contained in the pathogen_genotype or the host_genotype.

You have to dereference the identifiers in the following sequence to get back to the gene:

pathogen_genotype and host_genotype dereference to a genotype in the genotypes object,
the id property of each locus allele object in a genotype dereferences to an allele in the alleles object,
the gene property of an allele object dereferences to a gene in the genes object.

Once you have found the gene in one metagenotype, you can stop searching and continue to the next gene.

I should really add the pathogen or host status of the gene to the new export format to make it easier to query.

jashobanta-mcpl commented 1 year ago

@jseager7 , We are bit confused with the data capture flow for multiple PHIGID populating in Pathogen and Host block . Requesting for a flow with actual JSON snippet for traversing for Q00909 and any other host gene . Thanks .

jseager7 commented 1 year ago

@jashobanta-mcpl I just realised I made a mistake in my original comment, where I stated that the gene in the example was TRI5 of Fusarium graminearum (UniProtKB:Q00909, PHIG:253). The gene is actually RALF of Fusarium graminearum (UniProtKB:A0A0E0SJI5, PHIG:278).

This might be the cause of the confusion, since there are no host genes involved in any interaction with Q00909 (meaning only wild type hosts are involved). In this case, the Host section will presumably be empty and not displayed, but I will have to confirm this with the PHI-base team. It may be that we still want to display the host species names, but without a link to a corresponding PHIG ID.

I'll soon provide instructions (with JSON examples) for the case where both a pathogen gene and a host gene are involved.

jseager7 commented 1 year ago

@jashobanta-mcpl Please see below for more instructions on how to populate the Pathogen and Host sections.

Pathogen gene pages

The pathogen gene examples will use the Tox1 gene from Parastagonospora nodorum (UniProtKB:A9JX75), since that is the only example I can find of a pathogen gene with multiple strains that is also involved in interactions with host genes. Note that for some reason, Tox1 does not appear on the PHI-base 5 website, despite being provided in the latest JSON export.

Pathogen section

To populate the Pathogen section for Tox1, we must find all the genotypes that contain Tox1, so that we can find all the strains for Tox1.

We will start by finding all gene objects that contain the UniProtKB accession number for Tox1 (A9JX75) in the uniquename property. For example:

"genes": {
  "Parastagonospora nodorum A9JX75": {
    "organism": "Parastagonospora nodorum",
    "uniquename": "A9JX75"  // matches UniProtKB accession number
  },
}

Then we can find the allele objects by looking up the gene ID (the key of the gene object) in the alleles collection of the session object. Shown below is one example:

"A9JX75:bd02fdb6831712ca-36": {
  "allele_type": "wild_type",
  "gene": "Parastagonospora nodorum A9JX75",  // matches gene ID
  "name": "Tox1+",
  "primary_identifier": "A9JX75:bd02fdb6831712ca-36",
  "synonyms": []
},

(Note that there should be 7 allele objects in total containing Tox1.)

Next, we can look up each matching allele ID (the key of the matching allele object, or the primary_identifier property) in the genotypes collection of the session object. One allele ID may match more than one genotype, as in the example below:

"bd02fdb6831712ca-genotype-22": {
  "loci": [
    [
      {
        "expression": "Wild type product level",
        "id": "A9JX75:bd02fdb6831712ca-36"  // matches allele ID
      }
    ]
  ],
  "organism_strain": "Sn2000",
  "organism_taxonid": 13684
},
"bd02fdb6831712ca-genotype-10": {
  "comment": "A9JX75_PHANO expr level unknown",
  "loci": [
    [
      {
        "expression": "Wild type product level",
        "id": "A9JX75:bd02fdb6831712ca-36"  // matches allele ID
      }
    ]
  ],
  "organism_strain": "SN15",
  "organism_taxonid": 13684
},

(Note that there should be 10 genotype objects in total containing Tox1.)

Finally, the strain names must be extracted from the organism_strain property and displayed in the 'Experimental strain' column of the Pathogen section. The 'Pathogen ID' column can also be populated with values from the organism_taxonid property. The final list of pathogen strains would be as follows:

SN15
Sn2000
Sn79-1087

Here's what this would look like in the UI:

(Note that the rowspan on the table rows is optional, but recommended.)

Host section

For Tox1, the Host section will be populated with all the host genes that are involved in an interaction with the Tox1 gene.

The search process starts with finding all pathogen genotypes that reference Tox1, as shown in the instructions above.

From here, we look up each pathogen genotype ID (the key of the matching genotype object) in the metagenotypes collection of the session object. Specifically, we match on the pathogen_genotype property.

"metagenotypes": {
  "bd02fdb6831712ca-metagenotype-1": {
    "host_genotype": "bd02fdb6831712ca-genotype-3",
    "pathogen_genotype": "bd02fdb6831712ca-genotype-10",  // matches pathogen genotype ID
    "type": "pathogen-host"
  },
}

(There should be 22 metagenotype objects in total containing Tox1.)

We then extract the host genotype IDs from the host_genotype property of each metagenotype, and look up the host genotype IDs in the genotypes collection of the session object.

"bd02fdb6831712ca-genotype-3": {  // matches host genotype ID
  "comment": "SnTox1-sensitive",
  "loci": [
    [
      {
        "expression": "Wild type product level",
        "id": "W5AB81:bd02fdb6831712ca-14"
      }
    ]
  ],
  "organism_strain": "cv. Chinese Spring",
  "organism_taxonid": 4565
},

The id property in the loci array of the host genotype contains an allele identifier for the host gene. We can look this up in the alleles collection:

"W5AB81:bd02fdb6831712ca-14": {  // matches host allele ID
  "allele_type": "wild_type",
  "gene": "Triticum aestivum W5AB81",
  "name": "Snn1+",
  "primary_identifier": "W5AB81:bd02fdb6831712ca-14",
  "synonyms": []
},

The gene property in the allele object contains the gene identifier for the host gene. We can look this up in the genes collection:

"Triticum aestivum W5AB81": {  // matches host gene ID
  "organism": "Triticum aestivum",
  "uniquename": "W5AB81"
}

Finally, the UniProtKB accession number can be retrieved from the uniquename property, and this can be used to map to the PHIG ID for the host gene.

In the case of Tox1, there is only one host gene involved: Snn1, which has the UniProtKB accession number W5AB81. Here's what this would look like in the UI:

(Note that PHIG:339 is a placeholder, since Snn1 doesn't have any PHIG ID assigned yet.)

Other pathogen genes are involved in interactions with multiple host genes, such as RALF of Fusarium graminearum (PHIG:278), which interacts with the following host genes:

A0A3B6ITE6
A0A3B5YY12
A0A3B5ZUX1
A0A3B6HVQ9
A0A3B5Y122
A0A3B6JHD1

Here's what this would look like in the UI:

Host gene pages

For the host gene pages, the process is effectively the same as the pathogen gene pages, the only difference being:

the Host section is populated by searching the list of genotypes for the UniProtKB accession number of the host gene, and
the Pathogen section is populated by first searching for the matching host genotype IDs in the host_genotype property of each metagenotype object, then finding all the pathogen genes referenced by the pathogen_genotype property of that metagenotype object.

I can provide a full example of the Pathogen and Host sections for a host gene, if required.

jseager7 commented 1 year ago

@jashobanta-mcpl In the last meeting we decided on some further requirements for the Pathogen and Host sections:

Pathogen genes involved in interactions with wild type hosts (that is, host genotypes with no alleles) should have these host species listed in the Host section of the pathogen gene page.
Genes that interact through Physical Interaction annotations should also be included in the Host section (for pathogen genes) or the Pathogen section (for host genes).

See below for instructions.

Wild type hosts

This logic only applies to pathogen gene pages. During the process of looking up host genotypes in metagenotypes (i.e. the metagenotypes that also contain the pathogen gene), you may find host genotypes that have no alleles.

Here's an example of a metagenotype that involves the TRI5 gene (PHIG:253) of Fusarium graminearum:

"metagenotypes": {
  "d7b3170ded99924f-metagenotype-1": {
    "host_genotype": "Triticum-aestivum-wild-type-genotypeBobwhite",  // wild type host genotype
    "pathogen_genotype": "d7b3170ded99924f-genotype-1",  // pathogen genotype containing TRI5
    "type": "pathogen-host"
  }
}

Here is what the wild type host genotype looks like:

"genotypes": {
  "Triticum-aestivum-wild-type-genotypeBobwhite": {
    "loci": [],
    "organism_strain": "cv. Bobwhite",
    "organism_taxonid": 4565
  }
}

Note that the loci array is empty, indicating that there are no alleles.

In these cases, the Host section on the gene page will contain the host species name and the NCBI Taxonomy ID, but the PHIG ID column will be left blank and the Host gene column will have a placeholder of "(no genes)".

The host taxon ID can be retrieved from the organssm_taxonid property of the genotype object. The species name can be retrieved by looking up the taxon ID in the organisms object of the curation session, and getting the full_name property:

"organisms": {
  "4565": {
    "full_name": "Triticum aestivum"
  }
}

Physical interactions

Physical interaction annotations are not metagenotype annotations, so the logic for extracting genes from these interactions is different.

Using EPI1 (PHIG:268) of Phytophthora infestans as an example, the first step is to find all Physical Interaction annotations that contain the UniProtKB accession number for EPI1, which is D0MVC9.

Here is an example annotation:

{
  "checked": "no",
  "creation_date": "2019-10-16",
  "curator": {
    "community_curated": false
  },
  "evidence_code": "Affinity Capture-Western",
  "figure": "Figure 5",
  "gene": "Phytophthora infestans D0MVC9",  // pathogen gene is EPI1
  "interacting_genes": [
    "Solanum lycopersicum O04678"  // host gene is P69B
  ],
  "publication": "PMID:15096512",
  "status": "new",
  "submitter_comment": "",
  "type": "physical_interaction"
}

Physical Interaction annotations can be identified by the type property having a value of `physical_interaction.

Finding the interacting gene ID

The pathogen gene ID can be contained in either the gene property, or as the first item in the interacting_genes array.

Get the gene ID from the gene property.
Look up the gene ID in the genes object of the curation session, and get the matching gene object.
If the uniquname property of the gene object matches the UniProtKB accession number:
1. Set the host gene to the first item of the interacting_genes array.
Else:
1. Get the gene ID from the first item of the interacting_genes array.
2. Look up the gene ID in the genes object of the curation session, and get the matching gene object.
3. If the uniquname property of the gene object matches the UniProtKB accession number:
  1. Set the host gene to the gene property of the annotation object.
4. Else:
  1. Continue to the next annotation (because the current annotation does not contain the pathogen gene).

Finding the interacting gene information

Once the host gene ID is found (and confirmed to be a valid, following the checks below) then the following steps can be used to get the information for the Host section:

Look up the host gene ID in the genes object of the curation session, and get the matching gene object.
Get the UniProtKB accession number from the uniquename property of the gene object. This can be used to get the host gene name (from UniProtKB) and the PHIG ID for the host gene.
The host species name is in the organism property of the gene object.
The NCBI Taxonomy ID can be found by searching the organisms object in the curation session for an object with a full_name property that matches the host species name. The taxon ID will be the key of the object.

Additional Physical Interaction requirements

There are two complications with Physical Interaction annotations that must be handled:

Physical Interaction annotations can occur within the same species, so we first need to confirm that the gene ID does not belong to the same species as the gene of the gene page. To do this, we merely need to check that the species name contained in the gene ID is different to the species name of the gene of the gene page.
Physical Interaction annotations can be between two pathogens or two hosts (instead of one pathogen and one host), so an additional check is needed to confirm that the interacting organism is of a different role to the organism of the gene page. See the next section for instructions.

Excluding same-role Physical Interactions

Shown below is an example of a Physical Interaction annotation between two pathogens. This type of Physical Interaction should be ignored when populating the list of pathogen genes in the Pathogen section.

{
  "checked": "yes",
  "creation_date": "2019-08-12",
  "curator": {
    "community_curated": false
  },
  "evidence_code": "Two-hybrid",
  "figure": "Figure 6C",
  "gene": "Saccharomyces cerevisiae P22007",  // first pathogen
  "interacting_genes": [
    "Magnaporthe oryzae L7JC49"
  ],
  "publication": "PMID:31250536",
  "status": "new",
  "submitter_comment": "RAM1 interacts with RAS1",  // second pathogen
  "type": "physical_interaction"
}

In the current export format, the only way to confirm whether a gene belongs to a pathogen is to check whether a genotype containing the gene has been annotated as a pathogen_phenotype annotation, or whether a genotype containing the gene is in the pathogen_genotype property of a metagenotype.

With the example above, there are no metagenotypes, so we can only use pathogen_phenotype annotations to decide whether the gene is a pathogen gene. The process is as follows.

First, find an allele containing the gene ID "Saccharomyces cerevisiae P22007":

"P22007:ab02789a62331ecf-3": {
  "allele_type": "other",
  "description": "transformant",
  "gene": "Saccharomyces cerevisiae P22007",  // gene ID matches
  "name": "pYES2-MoRAM1+",
  "primary_identifier": "P22007:ab02789a62331ecf-3",
  "synonyms": []
}

Next, find a genotype containing the allele ID for this allele:

"ab02789a62331ecf-genotype-7": {
  "comment": "complementation MoRAM1+ complements ScRAM1-",
  "loci": [
    [
      {
        "id": "P22007:ab02789a62331ecf-1"
      }
    ],
    [
      {
        "expression": "Overexpression",
        "id": "P22007:ab02789a62331ecf-3"  // allele ID matches
      }
    ]
  ],
  "organism_strain": "Unknown strain",
  "organism_taxonid": 4932
}

Next, find a pathogen phenotype annotation that references this genotype ID:

{
  "checked": "no",
  "conditions": [
    "PECO:0000102",
    "PECO:0005224",
    "PECO:0005247",
    "PECO:0000004",
    "PECO:0005269"
  ],
  "creation_date": "2019-08-12",
  "curator": {
    "community_curated": false
  },
  "evidence_code": "Cell growth assay",
  "extension": [],
  "figure": "Figure 6a",
  "genotype": "ab02789a62331ecf-genotype-7",  // genotype ID matches
  "publication": "PMID:31250536",
  "status": "new",
  "submitter_comment": "...",
  "term": "PHIPO:0000405",
  "type": "pathogen_phenotype"  // annotation type matches
}

The same process can be repeated for the interacting gene ID, "Magnaporthe oryzae L7JC49", confirming that both the primary gene and the interacting gene are pathogen genes:

{
  "checked": "no",
  "conditions": [],
  "creation_date": "2019-08-17",
  "curator": {
    "community_curated": false
  },
  "evidence_code": "Western blot assay",
  "extension": [
    {
      "rangeDisplayName": "L7JC49_MAGOP",
      "rangeType": "Gene",
      "rangeValue": "L7JC49",
      "relation": "assayed_using"
    }
  ],
  "figure": "Figure 6d",
  "genotype": "ab02789a62331ecf-genotype-8",  // genotype contains Magnaporthe oryzae L7JC49
  "publication": "PMID:31250536",
  "status": "new",
  "submitter_comment": "...",
  "term": "PHIPO:0001027",
  "type": "pathogen_phenotype"  // annotation type matches
}

Better ways to identify pathogen and host genes

Since the process described above is very convoluted, and not even guaranteed to work all the time, a much simpler solution would be to extend the PHI-Canto JSON export with an additional property in the organism objects, stating whether an organism is a pathogen or host in each curation session. Here's a mockup:

"organisms": {
  "318829": {
    "full_name": "Magnaporthe oryzae",
    "role": "pathogen"
  },
  "4513": {
    "full_name": "Hordeum vulgare",
    "role": "host"
  },
  "4530": {
    "full_name": "Oryza sativa",
    "role": "host"
  },
  "4932": {
    "full_name": "Saccharomyces cerevisiae",
    "role": "pathogen"
  }

Alternatively, the list of host and pathogen species that are stored on the PHI-base/data repository could be used to classify the species as pathogen or host.

Please let me know which solution would be the easiest for you.

jashobanta-mcpl commented 1 year ago

Implemented . Indexing is in progress .

jseager7 commented 1 year ago

In the last meeting we decided on the following additional requirements:

If the pathogen gene is part of a metagenotype with a wild type host, and also part of a metagenotype with a specified host gene, then the Host section should show both of these cases. The metagenotype with a specified host gene should not override the metagenotype with a wild type host. Specifically, there should be:
1. one row for the wild type host, where the 'Host gene' column has the text "(no genes)"; and
2. one row for the host genotype, where the 'Host gene' column has the host gene name (or UniProtKB accession number).
Genes from Physical Interaction annotations should be included in the Pathogen and Host sections.
1. Specifically, this means that for a pathogen gene, the interacting host gene from a Physical Interaction should be shown in the Host section. For a host gene, the interacting pathogen gene from a Physical Interaction should be shown in the Pathogen section.
2. Note that Physical Interaction annotations can be between a pathogen and a pathogen or a host and a host: in this cases, the interacting gene should not be included in the Pathogen or Host sections.

It seems like requirement 1) is already implemented in PHIG:278, but this might need further checking.

For requirement 2), it seems that same-role physical interactions are being excluded as expected, but there are some cases where the Pathogen or Host section is not being populated with genes from Physical Interaction annotations.

PHIG:297 is an example of an the problem: the Physical Interaction section lists interactions with the RAM1 gene of S. cerevisiae (PHIG:300), but this gene is not included in the Host section of the gene page (in fact, the Host section is not shown at all).

We expected to see the RAM1 gene in the Host section, as in the following mockup:

CuzickA commented 1 year ago

Hi @jseager7, I've just been looking at the curation session for the Physical interaction annotation above.

I believe there is a curation error. The Physical interaction annotations should be between pathogen proteins within the same pathogen species. So there should be no host. Magnaporthe oryzae RAM1 interacting with Magnaporthe oryzae RAS1 Magnaporthe oryzae RAM1 interacting with Magnaporthe oryzae RAS2

I shall make the changes from ScRAM1 to MoRAM1 in the curation session.

Note to self: I think this curation error was made because because both MoRAM1 and ScRAM1 were reported on in figure 6 and the PHI-Canto user interface shows both genes being called the same name 'RAM1'.

Linking ticket to https://github.com/PHI-base/curation/issues/33

jseager7 commented 1 year ago

@CuzickA Thanks for clarifying this, but even if the annotation is a curation error, the fact remains that the webpage doesn't seem to be displaying this case correctly.

We can still use this incorrect annotation to verify that the logic used to display the host genes in the Host section is correct, since I think the incorrect annotation might be the only example of this case that we have at the moment.

We can resolve the curation error when PHI-base 5 loads the next JSON export.

jashobanta-mcpl commented 1 year ago

@jseager7 : In PHIG:297 . Both are of Pathogen Genes . Hence Host block is missing.

{ "checked": "yes", "creation_date": "2019-08-12", "curator": { "community_curated": false }, "evidence_code": "Affinity Capture-Western", "figure": "Figure 6B", "gene": "Saccharomyces cerevisiae P22007", ---ab02789a62331ecf-genotype-7 "interacting_genes": [ "Magnaporthe oryzae L7JGN0" --- --- ab02789a62331ecf-genotype-8 ], "publication": "PMID:31250536", "status": "new", "submitter_comment": "RAM1 interacts with RAS2", "type": "physical_interaction" },

Please confirm the implementation .

jseager7 commented 1 year ago

@jashobanta-mcpl Sorry, that's my mistake. PHI-Canto is classifying both of these species as pathogens, and the interaction described in the publication is not a pathogen-host interaction.

So, in this case, there is indeed no Host block to display.

(Just to note, the gene page for PHIG:297 is still missing a Pathogen block though, which should be displayed.)

jashobanta-mcpl commented 1 year ago

@jseager7 : It's not picked because there are two logics for checking pathogen.

Going by the allele ->genotype id -> metagenotype block ->pathogen_phenotype (For Pathognen)
in case of physical Interaction , going by allele ->genotype id - > annotation block -> genotype key + type is 'pathogen_phenotype'

in PHIG:297 option 1 in not applicable due to metagenotype bock missing for that gennotype id .

Option 2 implementation needs to be added in parsing code . We will implement it . Please confirm both logic .

jseager7 commented 1 year ago

In the last meeting we decided it would be simpler to include the pathogen or host role for each species in the JSON export, so I'll work on adding that.

jseager7 commented 1 year ago

@jashobanta-mcpl

Here's my feedback on the display of the Pathogen and Host sections.

Pathogen gene page

The following text uses PHIG:278 as an example of a pathogen gene.

Pathogen section

The Pathogen section on a pathogen gene page is not displayed correctly. The section should display a list of strains for the pathogen gene. It should not have the columns 'Pathogen gene' or 'PHI ID'.

Compare this to the mockup in the original comment:

As a reminder, the pathogen strains need to be collected from the metagenotypes shown on the gene page.

Host section

The Host section on a pathogen gene page now looks as expected, the only problem being that PHIG IDs are not hyperlinked to their respective gene pages:

The mockup in the original comment had these IDs hyperlinked to the gene page. In the example above, there should be a link to the gene page for PHIG:281.

Host gene page

The following text uses PHIG:311 as an example of a pathogen gene.

Pathogen section

The Pathogen section on the host gene page displays correctly, the only problem being the lack of hyperlinks on the PHIG IDs.

The text "PHIG:312" should be hyperlinked to the gene page for PHIG:312.

Host section

Unfortunately, the Host section still isn't displayed correctly on host gene pages, since the host strains are not shown.

wrong_host_section

The mockup in the original comment had all the host strains for the Cf-4A gene, but these are not shown in the current UI.

correct_host_section

I've also noticed that the metagenotype details pop-up is missing the strain and species for the host:

Maybe this is related to the data indexing issue affecting issue #69.

jashobanta-mcpl commented 1 year ago

It's fixed .

Hyperlink to PHIG ID is enabled.

jseager7 commented 1 year ago

@jashobanta-mcpl Thanks, the Pathogen section looks correct on all the host gene pages that I checked.

However, the Host section is still wrong on other host gene pages.

For example, PHIG:350 still has the 'Host gene' and 'PHI ID' columns, when it should have the 'Host strain' column. It also has a pathogen gene from Parastagonospora nodorum included in the Host section.

The image below shows I expected to see for PHIG:350.

Some more examples of this problem are PHIG:276, PHIG:292, and PHIG:342.

PHIG:292 is a difficult case since there are no annotations (and therefore no strains). It may be better in this case to show a placeholder for no strains in the Host section:

jashobanta-mcpl commented 1 year ago

@jseager7 : Looks like cache issue for PHIG:350 and others. Please check once .

jseager7 commented 1 year ago

Clearing the cache fixed the issue for all of the above gene pages.

The only change that is now needed is to ensure that host gene pages with no annotations display '(no strains)' in the Host strain column of the Host section.

jseager7 commented 1 year ago

@jashobanta-mcpl Just as a reminder, we still need a "(no strains)" placeholder in the Host section for host gene pages that have no annotations.

Currently, PHIG:292 appears like this:

which is almost correct, but the "(no strains)" placeholder should be shown in the Host strain column.

jseager7 commented 10 months ago

The no strains placeholder is fixed for PHIG:292 now, though it only appeared after a cache reload.