al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
209 stars 96 forks source link

[question] How to use a multisample VCF with MethylKit #126

Closed Camethyleabergen closed 6 years ago

Camethyleabergen commented 6 years ago

Hi,

It's definitely not an issue, more a good practice that I'm looking for.

I have a VCF file for some samples, and I would like to use the possibility given by MethylKit to filter the C->T mutations. My issue is about the fact my vcf is multisample (and long story short, some samples in my VCF are not in the methylKit Object).

I tried to use VariantAnnotation Package to convert a vcf file into a GRanges object, but it seems that the multisampling is not taken in account there.

Do you have any good practice about that?

Best,

al2na commented 6 years ago

you can not filter C->T mutations from a VCF file using methylKit

On Tue, Jul 24, 2018 at 3:44 PM Camethyleabergen notifications@github.com wrote:

Hi,

It's definitely not an issue, more a good practice that I'm looking for.

I have a VCF file for some samples, and I would like to use the possibility given by MethylKit to filter the C->T mutations. My issue is about the fact my vcf is multisample (and long story short, some samples in my VCF are not in the methylKit Object).

I tried to use VariantAnnotation Package to convert a vcf file into a GRanges object, but it seems that the multisampling is not taken in account there.

Do you have any good practice about that?

Best,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/126, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9EVUXlmCl7B_vXLQoIUgzJ0uTt_2pks5uJyS4gaJpZM4VcxJz .

Camethyleabergen commented 6 years ago

Hi Altuna,

thanks for your quick answer. I guess I can't directly indeed, and I have to convert it to a GRanges object, my question was actually how to do so and keeping the multisample information. If your point is that you can't filter out the potentially C->T mutations, I'm a bit concerned as it's the purpose of the paragraph "Filtering CpGs" in the tutorial you gave to MethylKit.

Best,

al2na commented 6 years ago

You can filter CpGs based on coverage, other quantitative features and location as shown in the tutorial.

You can’t read VCF files with methylKit, you can read them as GRanges and do whatever filtering GRanges objects allow. Your question seems to have nothing to do with methyKit but a general question on how to filter VCF files

On Wed 25. Jul 2018 at 09:39, Camethyleabergen notifications@github.com wrote:

Hi Altuna,

thanks for your quick answer. I guess I can't directly indeed, and I have to convert it to a GRanges object, my question was actually how to do so and keeping the multisample information. If your point is that you can't filter out the potentially C->T mutations, I'm a bit concerned as it's the purpose of the paragraph "Filtering CpGs" in the tutorial you gave to MethylKit.

Best,

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/126#issuecomment-407663734, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9EbuUXl2iijG-ACNcYBPRLbluol4kks5uKCDGgaJpZM4VcxJz .

-- Sent from mobile, excuse the brevity

Camethyleabergen commented 6 years ago

Thanks for your answer. It seems that I totally did not understand that part of your tutorial then :

Now, let’s assume we know the locations of C->T mutations. These locations should be removed from the analysis as they do not represent bisulfite treatment associated conversions. Mutation locations are stored in a GRanges object, and we can use that to remove CpGs overlapping with mutations. In order to do overlap operation, we will convert the methylKit object to a GRanges object and do the filtering with %over% function within [ ]. The returned object will still be a methylKit object.

How can I know the locations of the C->T mutations if they don't come from a VCF file at first?

al2na commented 6 years ago

Now I got your question, you need to extract the locations of C-> T mutations from VCF and use those to filter methylKit objects as shown in tutorial

Check this thread https://support.bioconductor.org/p/94451/

You need to use other packages to do what you want. variantAnnotation package could also help

On Wed 25. Jul 2018 at 09:53, Camethyleabergen notifications@github.com wrote:

Thanks for your answer. It seems that I totally did not understand that part of your tutorial then :

Now, let’s assume we know the locations of C->T mutations. These locations should be removed from the analysis as they do not represent bisulfite treatment associated conversions. Mutation locations are stored in a GRanges object, and we can use that to remove CpGs overlapping with mutations. In order to do overlap operation, we will convert the methylKit object to a GRanges object and do the filtering with %over% function within [ ]. The returned object will still be a methylKit object.

How can I know the locations of the C->T mutations if they don't come from a VCF file at first?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/126#issuecomment-407667138, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9EQ5FUXP-yKi7w6OFAbBXGaUL-t82ks5uKCQAgaJpZM4VcxJz .

-- Sent from mobile, excuse the brevity

Camethyleabergen commented 6 years ago

OK :) All fine thanks !

Camethyleabergen commented 6 years ago

Yet, I permit myself a question, how can I apply to a unite MethylKit object with different samples that correction from the generated GRanges mutation position list? Do those information have to come in the extra column information of my GRanges object? Or should it be done before the unite step ?

Shall I use a different vcf per sample, shall I use a specific format?

thanks ++

al2na commented 6 years ago

I would apply unite first and then drop every position that has a C-T mutation in any of the samples

On Wed 25. Jul 2018 at 13:34, Camethyleabergen notifications@github.com wrote:

Yet, I permit myself a question, how can I apply to a unite MethylKit object with different samples that correction from the generated GRanges mutation position list? Do those information have to come in the extra column information of my GRanges object? Shall I use different vcf per sample, shall I use a specific format?

thanks ++

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/126#issuecomment-407723812, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9ESBkD_w8ITb7w5PYbhZb1FsPLffbks5uKFepgaJpZM4VcxJz .

-- Sent from mobile, excuse the brevity

Camethyleabergen commented 6 years ago

Thanks Altuna.

I have two cases in mind where it seems tricky for me :

I'll take the case of SNV and not SNP to explain the following :

Would it be relevant to treat and filter methylkit objects before being united ?

Best,

al2na commented 6 years ago

With the default settings, of a CpG is not covered in all samples you will not see that CpG in methylBase object. You can change that behavior with min.per.group or sth like that argument in unite, then it might make sense to filter before unite

On Thu 26. Jul 2018 at 15:24, Camethyleabergen notifications@github.com wrote:

Thanks Altuna.

I have two cases in mind where it seems tricky for me :

I'll take the case of SNV and not SNP to explain the following :

-

in the mtdna (I'm currently having a project on it, however I know the material in itself is tricky). I have around 20 samples for now, but it will grow soon. So far when filtering the VCF to have only the C-T mutations, I have about 280 C-positions (both strands) which have a SNV there. Is it really realistic to not analyse the 19 other samples if only one have a SNV on a C->T position? I would -I guess but I'm open to discussion and debate ;) - probably think that the analysis of 9 versus 10 samples (if I have 2 groups) is still relevant?

in nuclear DNA, I have not so many experience so far, but I'm thinking ahead a bit. Let's say we have 50 samples in 2 groups. if a SNV or SNP has a frequency of 0.01 or more, I would have about one chance out of two to discard a position. How can I be confident in the analysis ?

Would it be relevant to treat and filter methylkit objects before being united ?

Best,

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/126#issuecomment-408096204, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9EYnH05glCHpWea24IqFA3g1k6fZQks5uKcMVgaJpZM4VcxJz .

-- Sent from mobile, excuse the brevity

Camethyleabergen commented 6 years ago

Thanks Altuna :)