AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Detect Affymetrix HGU133A-HGU133B "pairs" in an expression data-driven manner (automatically) #144

Open jaclyn-taroni opened 6 years ago

jaclyn-taroni commented 6 years ago

New Issue Checklist

Context

Explain the conditions that led you to write this issue. If you are proposing a new feature, the context should be your user story. If this relates to a certain page or API endpoint, provide a link.

Older Affymetrix chips -- hgu133a and hgu133b -- sampled a smaller number of probes/genes. Sometimes samples would be run on both hgu133a and hgu133b to get better coverage of the transcriptome (sort of equivalent to running a sample using the later hgu133plus2 platform) resulting in paired a and b samples.

Problem or idea

The context should lead to something, an idea or a problem that you’re facing.

It would be great to automatically detected these paired samples and combine the information from both chips. Prior to taking this approach, there will be a considerable number of missing values in these samples as compared to their genome-wide successors. Some kind of imputation approach may be the path forward. If we use a neural network-based approach to reconstruct the other "half" of the sample, we may find that the true paired sample is most highly correlated with our reconstructed values.

Solution or next step

You can tag others or simply leave it for further investigation, but you must propose a next step towards solving the issue.

It's fine to run SCAN on these samples as planned. We should investigate this further after we have processed a sufficient number of samples to design experiments for investigating this issue.

Here's an example experiment from ArrayExpress: https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-11908/

jaclyn-taroni commented 6 years ago

Relevant to imputation strategies more generally: https://github.com/greenelab/deep-review/issues/910