AlexsLemonade / refinebio-examples

Example workflows for refine.bio data
https://www.refine.bio
Other
10 stars 5 forks source link

Question about batch effects in refine.bio datasets #455

Open ghost opened 3 years ago

ghost commented 3 years ago

Hi!

I am trying to understand whether batch effects are corrected for in the refine.bio pipeline.

I downloaded the dataset GSE99039 (microarray) from refine.bio then looked at the dataset using PCA. I noticed that the dataset from refine.bio seem to have a clear separation that does not match any of the metadata.

refine.bio PCA image

Hence would like to ask about i. where is the part in the pipeline that does the (quantile?) normalization ii. i understand that for the normalized data pipeline if any batch correction was performed.

Thank you.

jaclyn-taroni commented 3 years ago

Hi @kengcher,

Thanks for your questions and for using refine.bio. The dataset you mention (GSE99039) is submitter-processed, which means we were unable to process the data from raw files and use whatever values the authors submitted to GEO (in this case, it is reported to be RMA normalized values). We do quantile normalize submitter-processed data for delivery, but have less control over what happens prior to that step. We do not perform any batch correction (e.g., ComBat).

Looking at the description for this particular experiment, I would want to know if that separation corresponds to idiopathic PD vs. controls, but you do mention that the separation does not match any of the metadata in your post.

Hope this helps! Let me know if you have additional questions.

ghost commented 3 years ago

Hi Jacyln

Thanks for getting back!

GEO indicates that the CEL files are available: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99039

How does refine.bio decide whether to process from CEL or otherwise?

The two clusters do not match controls vs diseases:

[image: image.png]

On Mon, Mar 29, 2021 at 11:40 AM Jaclyn Taroni @.***> wrote:

Hi @kengcher https://github.com/kengcher,

Thanks for your questions and for using refine.bio. The dataset you mention (GSE99039) is submitter-processed, which means we were unable to process the data from raw files and use whatever values the authors submitted to GEO (in this case, it is reported to be RMA normalized values). We do quantile normalize submitter-processed data for delivery, but have less control over what happens prior to that step. We do not perform any batch correction (e.g., ComBat).

Looking at the description for this particular experiment, I would want to know if that separation corresponds to idiopathic PD vs. controls, but you do mention that the separation does not match any of the metadata in your post.

Hope this helps! Let me know if you have additional questions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/refinebio-examples/issues/455#issuecomment-809483994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBKGAAXJ3ZW2K5BHCHOLTTGCNP7ANCNFSM4Z33T5XQ .

jaclyn-taroni commented 3 years ago

We've looked into why this particular experiment was not processed from raw and believe we may have identified a fix, which we will now need to test. If the fix works, we can expect to make the version of this experiment processed from raw within the next few weeks. We're in the middle of some infrastructure changes for the project, so we appreciate your patience!