JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

reference for gene expression #10

Closed elimereu closed 5 years ago

elimereu commented 5 years ago

Hi Jean,

I'm a new user of this amazing tool. I'm preparing data before starting with the analysis of CNVs in my CLL sample. I have a couple of questions related to the gene expression data I should use to infer the deviations between normal and cancer sample. My questions are:

I found a bulk rna-seq dataset of memory B cells with only one sample, that I could use as reference. However, I'm not sure, if one sample is enough for this task.

As reference normal data, could I use single cell rna-seq (for example B cells) data from a public repository rather than bulk rna-seq?

What do you think?

Thank you in advance.

Bests, Elisabetta

JEFworks commented 5 years ago

Hi Elisabetta,

Thanks for trying out HoneyBADGER!

A bulk RNA-seq dataset of memory B cells with one sample could be used as a reference. If multiple reference samples are provided, they will be averaged anyway.

However, as you’ve seen, there are many B cell single cell RNA-seq datasets available. We generally recommend matching the reference gene expression dataset to your cancer sample as much as possible so that deviations in expression in your cancer sample from the reference can be more confidently attributed to CNV changes rather than technical artifacts. In my experience, I obtain the best results if I match by platform - so for a single cell CLL sample sequenced with 10X, I will generally try to match with a B cell sample sequenced by 10X. However, there are always other caveats like reference dataset quality (maybe the public B cell sample was sequenced too shallowly for example such that gene expression is essentially binary) that could lead a bulk RNA-seq sample to be a better reference.

It’s also worth keeping in mind the size of the CNVs you’re trying to identify. Sensitivity to the reference will be less prominent for larger CNVs (chromosome arm or whole chromosome) that span more genes so the reference choice may not matter as much. Whereas common CLL deletions like del(11a) del(13q) and so forth are often quite small and will likely require leveraging allele information in addition to an appropriate expression for confident inference. Do note that the allele-based model is also much more sensitive at identifying deletions than the expression-based model alone.

Hope that helps!

Best, Jean

On Jul 31, 2018, at 11:02 AM, Elisabetta notifications@github.com<mailto:notifications@github.com> wrote:

Hi Jean,

I'm a new user of this amazing tool. I'm preparing data before starting with the analysis of CNVs in my CLL sample. I have a couple of questions related to the gene expression data I should use to infer the deviations between normal and cancer sample. My questions are:

I found a bulk rna-seq dataset of memory B cells with only one sample, that I could use as reference. However, I'm not sure, if one sample is enough for this task.

As reference normal data, could I use single cell rna-seq (for example B cells) data from a public repository rather than bulk rna-seq?

What do you think?

Thank you in advance.

Bests, Elisabetta

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JEFworks_HoneyBADGER_issues_10&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=2gb0vmLv11Vi98WTAqlCXyDkhi11d9lKeGWDXEU-qNw&m=CQ2-xxDVR77It_p9UbkfKsQqv9iuoodUD6jnR1Wg63g&s=V_WBNpMOVwdf6E8IYEIY5_0MYF_mIFqXM_6eayjgZkE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIj2SHKZHQYwk1-5FHlXKvy88cb8MSSa9oks5uMHF2gaJpZM4Vobf0&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=2gb0vmLv11Vi98WTAqlCXyDkhi11d9lKeGWDXEU-qNw&m=CQ2-xxDVR77It_p9UbkfKsQqv9iuoodUD6jnR1Wg63g&s=rvX5XQ-pYX8xCIXGkC2hEpFnrOdzlUccHeXt8EuhnF8&e=.

elimereu commented 5 years ago

Thanks a lot for all these helpful details Jean! It’s really much more clear now what problems I could face by using bulk or single cell data as reference.

Thanks,

Elisabetta

Il giorno 31 lug 2018, alle ore 17:25, Jean Fan notifications@github.com ha scritto:

Hi Elisabetta,

Thanks for trying out HoneyBADGER!

A bulk RNA-seq dataset of memory B cells with one sample could be used as a reference. If multiple reference samples are provided, they will be averaged anyway.

However, as you’ve seen, there are many B cell single cell RNA-seq datasets available. We generally recommend matching the reference gene expression dataset to your cancer sample as much as possible so that deviations in expression in your cancer sample from the reference can be more confidently attributed to CNV changes rather than technical artifacts. In my experience, I obtain the best results if I match by platform - so for a single cell CLL sample sequenced with 10X, I will generally try to match with a B cell sample sequenced by 10X. However, there are always other caveats like reference dataset quality (maybe the public B cell sample was sequenced too shallowly for example such that gene expression is essentially binary) that could lead a bulk RNA-seq sample to be a better reference.

It’s also worth keeping in mind the size of the CNVs you’re trying to identify. Sensitivity to the reference will be less prominent for larger CNVs (chromosome arm or whole chromosome) that span more genes so the reference choice may not matter as much. Whereas common CLL deletions like del(11a) del(13q) and so forth are often quite small and will likely require leveraging allele information in addition to an appropriate expression for confident inference. Do note that the allele-based model is also much more sensitive at identifying deletions than the expression-based model alone.

Hope that helps!

Best, Jean

On Jul 31, 2018, at 11:02 AM, Elisabetta notifications@github.com<mailto:notifications@github.com> wrote:

Hi Jean,

I'm a new user of this amazing tool. I'm preparing data before starting with the analysis of CNVs in my CLL sample. I have a couple of questions related to the gene expression data I should use to infer the deviations between normal and cancer sample. My questions are:

I found a bulk rna-seq dataset of memory B cells with only one sample, that I could use as reference. However, I'm not sure, if one sample is enough for this task.

As reference normal data, could I use single cell rna-seq (for example B cells) data from a public repository rather than bulk rna-seq?

What do you think?

Thank you in advance.

Bests, Elisabetta

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JEFworks_HoneyBADGER_issues_10&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=2gb0vmLv11Vi98WTAqlCXyDkhi11d9lKeGWDXEU-qNw&m=CQ2-xxDVR77It_p9UbkfKsQqv9iuoodUD6jnR1Wg63g&s=V_WBNpMOVwdf6E8IYEIY5_0MYF_mIFqXM_6eayjgZkE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIj2SHKZHQYwk1-5FHlXKvy88cb8MSSa9oks5uMHF2gaJpZM4Vobf0&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=2gb0vmLv11Vi98WTAqlCXyDkhi11d9lKeGWDXEU-qNw&m=CQ2-xxDVR77It_p9UbkfKsQqv9iuoodUD6jnR1Wg63g&s=rvX5XQ-pYX8xCIXGkC2hEpFnrOdzlUccHeXt8EuhnF8&e=.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.