Closed vincerubinetti closed 1 year ago
So we'd end up with this:
{
"_id",
"_score",
"name", // NEW
"description", // NEW
"date", // NEW
"author" / "source", // NEW
"genes": [],
"is_public",
"taxid"
}
@dongbohu Here are the values we will use as name
and description
for each data source:
source: ctd
name: "${ctd.chemical_name} interactions"
description: "Chemical-gene interactions of ${ctd.chemical_name} in ${species_name}.
source: do
_id: ${do.id}
name: (The value after ":" in the current _id field.)
description: ${do.abstract}
source: go
name:
description: ${definition}
source: kegg
name: ${kegg.name}[0]
description:
source: msigdb
name:
description:
source: reactome
name: ${reactome.geneset_name}
description:
source: wikipathways
name: ${wikipathways.pathway_name}
description:
I think we can leave description blank if we cannot obtain or generate one for a datasource. We should also delete the duplicate fields.
@ravila4 Thanks. I will add the news fields for kegg
, do
and ctd
, and you can take care of the others.
Great. I'm still undecided about the date
field. I'm not sure if we have a good value to add to it.
date
can be either the date when the geneset is generated by the parser, or the date when the data source is updated. I prefer the former, because when we may change the parser code, the geneset may become different (even if the data source is not updated at all).
I brought up the issue with Chunlei, and he still feels that a date
field for public genesets is not that relevant.
Another reason, is that for each release we get a release notes doc like this:
Build version: '20210302'
=========================
Previous build version: '20210211'
Generated on: 2021-03-02 at 10:25:00
+--------------------+------------------------------------+------------------------------------+-----------------+---------------+
| Updated datasource | prev. release | new release | prev. # of docs | new # of docs |
+--------------------+------------------------------------+------------------------------------+-----------------+---------------+
| ctd | - | March-2-2021-16438M | - | 22,959 |
| do | obo-2021-01-28_genemap2-2021-02-10 | obo-2021-02-24_genemap2-2021-03-01 | 4,254 | 4,258 |
| kegg | Release 97.0+/02-06, Feb 21 | Release 97.0+/02-27, Feb 21 | 5,453 | 5,456 |
| wikipathways | 20201210 | 20210210 | 1,693 | 1,723 |
+--------------------+------------------------------------+------------------------------------+-----------------+---------------+
New datasource(s): ctd
+----------------+----------+---------+
| Updated stats. | previous | new |
+----------------+----------+---------+
| total | 120,332 | 143,328 |
+----------------+----------+---------+
New field(s): ctd
Overall, 143,328 documents in this release
23,002 document(s) added, 6 document(s) deleted, 1,670 document(s) updated
Modifying a field across all documents each time the uploader runs will obfuscate the statistics for meaningful data changes.
I brought up the issue with Chunlei, and he still feels that a
date
field for public genesets is not that relevant.
The user of the web app will definitely want to know the last update of the built-in genesets.
Also we should stop saying "public" when referring to the built-in genesets because user genesets will be able to be public or private. "built-in" or "curated" would be a better term to avoid confusion.
Modifying a field across all documents each time the uploader runs will obfuscate the statistics for meaningful data changes.
The intention was not to update all the genesets with a hardcoded date field. Our thought was that either the backend or the frontend could dynamically look up the metadata for the built-in geneset (stored only in one place) and return it with the result.
@vincerubinetti I agree on the wording, it does get confusing.
If you need the field for the user interface, I think it would be easier for the frontend to query the field from the metadata endpoint, rather than dynamically adding the field to the returned JSON for each geneset.
Nevertheless, there is another issue that in http://mygeneset.info/v1/metadata we don't have a good date string to use for individual data sources. The closest we have are build date, which applies to the entire index... or version strings, which have inconsistent formats:
wikipathways: "20210210", ctd: "March-2-2021-16438M", reactome: "75", msigdb: "7.2", go: "20210201", kegg: "Release 97.0+/02-27, Feb 21", do: obo-2021-02-24_genemap2-2021-03-01"
On the other hand, just because we re-run the uploader for a particular built-in datasource, it doesn't mean that all the genesets within it were modified, so it's misleading to conflate document datestamps with datasource datestamps. The front-end would need to be clear about the distinction.
The last missing action item in this issue was adding a way to query the last updated date for curated genesets. This has been implemented as new upload_date
and download_date
fields in the https://mygeneset.info/v1/metadata/ endpoint.
"kegg": {
"license": "Academic service provider license",
"code": {
"folder": "src/plugins/kegg",
"repo": "https://github.com/biothings/mygeneset.info.git",
"commit": "02b2a6e",
"branch": "master",
"url": "https://github.com/biothings/mygeneset.info/tree/02b2a6e17aaf90b1129fb22afdf403049427aba5/src/plugins/kegg"
},
"stats": {
"kegg": 5618
},
**"download_date": "2022-08-29T23:15:28.967000",**
"version": "Release 103.0+/08-29, Aug 22",
"license_url": "https://www.kegg.jp/kegg/legal.html",
"url": "https://www.kegg.jp",
**"upload_date": "2022-08-30T05:33:06.427000"**
},
Eventually users will be generating their own name, description, date, and author fields.
But for the built-in genesets, we currently don't have these fields. Can we possibly add them? Is description and date metadata available for KEGG and the like? As for the author, it could just be e.g. "KEGG", so more like a more generic creator/source field.