bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Feature request: alternative library prep kit support methylation pipeline #3203

Open lardenoije opened 4 years ago

lardenoije commented 4 years ago

Hi,

It is currently stated in the documentation that "Right now we only support the TruSeq Methyl Capture EPIC kit" for the methylation pipeline (https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#methylation). Would it be possible to add support for other library prep kits as well? We currently use the MGIEasy Whole Genome Bisulfite Sequencing Library Prep Kit (https://en.mgitech.cn/products/reagents_info/13/).

This will be my first WGBS dataset, so I am not sure how important it is to have support for this library prep kit, or whether I can safely run the pipeline as is. Although I am not sure how, I am happy to help in any way I can adding support for this kit.

roryk commented 4 years ago

Hi Roy,

Thanks! We're happy to add support for new kits, do you know the differences between the two kits? If we can figure out where they are different in terms of the NGS data we can add support for the kit or fix the docs saying we support both kits.

roryk commented 4 years ago

@jnhutchinson do these kits look different to you?

lardenoije commented 4 years ago

Hi Rory,

Thanks, I'm not sure what kit-specific information is used in the current pipeline. Maybe this concerns the adapter sequences? If I know that I can try to figure out if this differs between the kits.

roryk commented 4 years ago

Thanks Roy! There's a few things that could be different; adapter sequences used, where/how much to trim the reads and if the kit is stranded or not since it could be 4 different strands. I tagged @jnhutchinson who knows more about methylation than I do, hopefully he will have some thoughts as well. Thanks for helping out, teaming up makes everything so much easier.

jnhutchinson commented 4 years ago

Hi Rory, assuming the libraries are directional (a decent explanation of that is given by Gaurav Kadu on this Researchgate post, the main issue is whether your data is paired-end and how deeply you would want to trim the reads past the adapters. The current setup is hard-coded for the Truseq Methyl capture method, with paired end reads and 8 bp of extra trimming at the 5' and 3' ends of each read. @roryk , provided I'm interpreting the molecular biology of the MGIEasy kit correctly, the main changes would be to a) add parameters for paired versus single end, which would affect code for both trim_galore and bismark a) set a trimming depth parameter, which would only affect the trim_galore code. I have no idea how easy it would be to implement that, but I'm happy to help out if you would like to implement it. I think it would be a great step towards making the bisulfite pipeline more universally useful.

lardenoije commented 4 years ago

@jnhutchinson thanks for your input! I have asked the company about the adapter sequences and if the kit is stranded (= directional?) and will report back once I get an answer. From the description of the kit on the website I gather that the data generated must be paired-end (under product specification is says "Read length PE100/PE150") and this is also reflected by having 2 fastq files per sample.

roryk commented 4 years ago

Thanks Roy for doing the legwork! It is really helpful.

roryk commented 4 years ago

If the company has an example analysis script of how they process the data that would be super helpful as well since we could look and see how to do it.

jnhutchinson commented 4 years ago

Agree with Rory, an example analysis (even descriptive) would be helpful

lardenoije commented 4 years ago

I got a reply from the company with the adapter sequences: adapter3 = AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA adapter5 = AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG They also said that the kit is non-stranded. I will ask if they have an example analysis script.

lardenoije commented 4 years ago

The company does not have an example script or instructions for how to analyse the data, stating that the analysis should not be much different from other fastq WGBS datasets.

roryk commented 4 years ago

Thanks Roy! We'll be chatting about this on Monday and we should be able to put together support for this a few days after that. Thanks so much for doing the legwork.