biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

Geneset friendly name, description, date, author? #28

Closed vincerubinetti closed 1 year ago

vincerubinetti commented 3 years ago

Eventually users will be generating their own name, description, date, and author fields.

But for the built-in genesets, we currently don't have these fields. Can we possibly add them? Is description and date metadata available for KEGG and the like? As for the author, it could just be e.g. "KEGG", so more like a more generic creator/source field.

vincerubinetti commented 3 years ago

So we'd end up with this:

{
      "_id",
      "_score",
      "name", // NEW
      "description", // NEW
      "date", // NEW
      "author" / "source", // NEW
      "genes": [],
      "is_public",
      "taxid"
}
ravila4 commented 3 years ago

@dongbohu Here are the values we will use as name and description for each data source:

source: ctd
name: "${ctd.chemical_name} interactions"
description: "Chemical-gene interactions of ${ctd.chemical_name} in ${species_name}.

source: do
_id: ${do.id}
name: (The value after ":" in the current _id field.)
description: ${do.abstract}

source: go
name: 
description: ${definition}

source: kegg
name: ${kegg.name}[0]
description:

source: msigdb
name:
description:

source: reactome
name: ${reactome.geneset_name}
description:

source: wikipathways
name: ${wikipathways.pathway_name}
description:
ravila4 commented 3 years ago

I think we can leave description blank if we cannot obtain or generate one for a datasource. We should also delete the duplicate fields.

dongbohu commented 3 years ago

@ravila4 Thanks. I will add the news fields for kegg, do and ctd, and you can take care of the others.

ravila4 commented 3 years ago

Great. I'm still undecided about the date field. I'm not sure if we have a good value to add to it.

dongbohu commented 3 years ago

date can be either the date when the geneset is generated by the parser, or the date when the data source is updated. I prefer the former, because when we may change the parser code, the geneset may become different (even if the data source is not updated at all).

ravila4 commented 3 years ago

I brought up the issue with Chunlei, and he still feels that a date field for public genesets is not that relevant.

Another reason, is that for each release we get a release notes doc like this:

Build version: '20210302'
=========================
Previous build version: '20210211'
Generated on: 2021-03-02 at 10:25:00

+--------------------+------------------------------------+------------------------------------+-----------------+---------------+
| Updated datasource |           prev. release            |            new release             | prev. # of docs | new # of docs |
+--------------------+------------------------------------+------------------------------------+-----------------+---------------+
| ctd                |                 -                  |        March-2-2021-16438M         |               - |        22,959 |
| do                 | obo-2021-01-28_genemap2-2021-02-10 | obo-2021-02-24_genemap2-2021-03-01 |           4,254 |         4,258 |
| kegg               |    Release 97.0+/02-06, Feb 21     |    Release 97.0+/02-27, Feb 21     |           5,453 |         5,456 |
| wikipathways       |              20201210              |              20210210              |           1,693 |         1,723 |
+--------------------+------------------------------------+------------------------------------+-----------------+---------------+
New datasource(s): ctd

+----------------+----------+---------+
| Updated stats. | previous |     new |
+----------------+----------+---------+
| total          |  120,332 | 143,328 |
+----------------+----------+---------+

New field(s): ctd

Overall, 143,328 documents in this release
23,002 document(s) added, 6 document(s) deleted, 1,670 document(s) updated

Modifying a field across all documents each time the uploader runs will obfuscate the statistics for meaningful data changes.

vincerubinetti commented 3 years ago

I brought up the issue with Chunlei, and he still feels that a date field for public genesets is not that relevant.

The user of the web app will definitely want to know the last update of the built-in genesets.

Also we should stop saying "public" when referring to the built-in genesets because user genesets will be able to be public or private. "built-in" or "curated" would be a better term to avoid confusion.

Modifying a field across all documents each time the uploader runs will obfuscate the statistics for meaningful data changes.

The intention was not to update all the genesets with a hardcoded date field. Our thought was that either the backend or the frontend could dynamically look up the metadata for the built-in geneset (stored only in one place) and return it with the result.

ravila4 commented 3 years ago

@vincerubinetti I agree on the wording, it does get confusing.

If you need the field for the user interface, I think it would be easier for the frontend to query the field from the metadata endpoint, rather than dynamically adding the field to the returned JSON for each geneset.

Nevertheless, there is another issue that in http://mygeneset.info/v1/metadata we don't have a good date string to use for individual data sources. The closest we have are build date, which applies to the entire index... or version strings, which have inconsistent formats:

wikipathways: "20210210", ctd: "March-2-2021-16438M", reactome: "75", msigdb: "7.2", go: "20210201", kegg: "Release 97.0+/02-27, Feb 21", do: obo-2021-02-24_genemap2-2021-03-01"

ravila4 commented 3 years ago

On the other hand, just because we re-run the uploader for a particular built-in datasource, it doesn't mean that all the genesets within it were modified, so it's misleading to conflate document datestamps with datasource datestamps. The front-end would need to be clear about the distinction.

ravila4 commented 1 year ago

The last missing action item in this issue was adding a way to query the last updated date for curated genesets. This has been implemented as new upload_date and download_date fields in the https://mygeneset.info/v1/metadata/ endpoint.

    "kegg": {
      "license": "Academic service provider license",
      "code": {
        "folder": "src/plugins/kegg",
        "repo": "https://github.com/biothings/mygeneset.info.git",
        "commit": "02b2a6e",
        "branch": "master",
        "url": "https://github.com/biothings/mygeneset.info/tree/02b2a6e17aaf90b1129fb22afdf403049427aba5/src/plugins/kegg"
      },
      "stats": {
        "kegg": 5618
      },
      **"download_date": "2022-08-29T23:15:28.967000",**
      "version": "Release 103.0+/08-29, Aug 22",
      "license_url": "https://www.kegg.jp/kegg/legal.html",
      "url": "https://www.kegg.jp",
      **"upload_date": "2022-08-30T05:33:06.427000"**
    },