SuLab / GeneWikiCentral

GeneWiki Organization
MIT License
5 stars 2 forks source link

Add exrna-disease Links on Wikidata #114

Open thistlew opened 5 years ago

thistlew commented 5 years ago

We can leverage the exRNA Atlas JSON-LD API to make some statements about potential exRNA-disease links as indicated by exRNA Atlas RNA-seq data.

First, all documentation for the API can be found on Stoplight:

https://exrna-atlas.docs.stoplight.io/

The API already provides information about expression (and associated disease) for certain types of RNAs via the census endpoints:

https://exrna-atlas.org/exat/api/docs/census/miRNAs https://exrna-atlas.org/exat/api/docs/census/piRNAs https://exrna-atlas.org/exat/api/docs/census/tRNAs (may be less useful since tRNAs are lumped together into a handful of categories) https://exrna-atlas.org/exat/api/docs/census/snRNAs https://exrna-atlas.org/exat/api/docs/census/snoRNAs

Each of these routes contains expression data for individual RNAs within the specified RNA type (miRNAs, piRNAs, etc.). Expression is broken down by biofluid. Individual sample information is provided in comma-delimited lists for the following properties:

associatedBiosampleIDs: Atlas biosample ID associated with each sample rpms: RPM expression in each sample (normalized by number of reads that aligned to reference genome for the sample) conditions: Health condition associated with each sample

Each of these three lists is ordered in the same way, so the first entry in associatedBiosampleIDs matches up with the first entry in rpms and the first entry in conditions, etc.

It should be pretty simple to:

1) Go through all pages for a particular type of RNA (miRNAs, piRNAs, etc.) and record information about expression of each RNA for each condition (probably important to keep track of biofluid as well) 2) Set some kind of threshold for expression (say, 10 RPM in at least 50% of samples associated with the disease) 3) Record any exRNA-disease connections according to this threshold

Importantly, the ncRNA information provided by the census endpoints is a bit outdated (the census was last run on 2017/10/20, so data from newer samples are missing), but I think there's still a lot of valuable information to mine from the endpoints.