AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 20 forks source link

Link alternate accession IDs with experiments #2186

Closed dvenprasad closed 4 years ago

dvenprasad commented 4 years ago

Context

Someone requested for an experiment GSE130226 which exists on refine.bio under a different accession SRP193559. It did not show up when searching with the term GSE130226 since they were not linked.

Problem or idea

We need to link experiments with other accession ids they may have to make the search effective.

Solution or next step

Tagging @kurtwheeler for his thoughts.

Also, if we do link alternate accession, it would be good show them on the experiment page and/or the result cards.

kurtwheeler commented 4 years ago

OK so I think there's several things that need to be done here:

  1. We need to extend the SRA surveyor to get alternate accessions for RNASeq experiments. It looks like ENA, which we actually user for SRA metadata, doesn't have the alternate accessions. However SRA does: https://www.ncbi.nlm.nih.gov/sra/SRX5727074[accn] as does GEO itself.
  2. Extending the surveyor will fix this for all future experiments, but we don't want to re-survey all our existing experiments. Therefore we should make a script to just get the alternate accession and set it for all RNASeq experiments.
  3. As Deepa mentioned we should show them on the experiment page.
  4. We should make sure that our search is using alternate accessions. GSE44094's alternate_accession is E-GEOD-44094 but it doesn't turn up if I search for it: https://www.refine.bio/search?q=E-GEOD-44094. (We already store alternate accessions for microarray data so we can do this.)

I think that we should make this ticket deal with items 1 and 2. I'll make a frontend ticket for 3 and another backend ticket for 4.

cgreene commented 4 years ago

Extending the surveyor will fix this for all future experiments, but we don't want to re-survey all our existing experiments. Therefore we should make a script to just get the alternate accession and set it for all RNASeq experiments.

We should probably have a mode of running a surveyor that can refresh metadata. We would want this anyway if we ever improve/adjust the metadata processing code.