FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

fastq file splitting #460

Closed VikArz02 closed 3 years ago

VikArz02 commented 3 years ago

Hey! When I analyze the whole human methylome in the bismark, I get a low level of alignment (18%). I checked it in the FastqScreen program, everything is fine with the reads and the quality is also good. The trimming was carried out according to the recommendations. I thought that there was not enough capacity and because of this, an error might occur Thus, I decided to split the R1 and R2 files into 3 GB files. I aligned them, conducted an analysis for each, and then combined all the output data. Each file has leveled off by about 16-18%. Does such a decision take place?

FelixKrueger commented 3 years ago

Hi @VikArz02

Splitting files with 18% mapping efficiency into smaller chunks should still give you an overall result with 18% mapping efficiency. If you could drop me an email with a few sample reads (e.g. 100K, gzipped, untrimmed reads), I can take a quick look for you. Please also include the genome of interest, and the sample prep you used. Best, Felix

VikArz02 commented 3 years ago

Thank you for your help! Best, Viktoriia

вт, 21 сент. 2021 г. в 14:51, Felix Krueger @.***>:

Hi @VikArz02 https://github.com/VikArz02

Splitting files with 18% mapping efficiency into smaller chunks should still give you an overall result with 18% mapping efficiency. If you could drop me an email with a few sample reads (e.g. 100K, gzipped, untrimmed reads), I can take a quick look for you. Please also include the genome of interest, and the sample prep you used. Best, Felix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-923906743, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMGTR22CS6OWEJKTOVTUDBWVRANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

FelixKrueger commented 3 years ago

so, are you going to send over some data? I am sure we could rescue some data!

VikArz02 commented 3 years ago

I have already sent it

вт, 21 сент. 2021 г. в 15:47, Felix Krueger @.***>:

so, are you going to send over some data? I am sure we could rescue some data!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-923953140, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMB6WAUV3ZDPIDJSZULUDB5ETANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- С уважением, Виктория

FelixKrueger commented 3 years ago

After taking a quick look, the data seems to be non-directional human data, my guess would be prepared with the Zymo Pico-methyl kit? because of extensive bias at the 5' end, I ran Trim Galore like this:

trim_galore --paired --clip_r1 15 --clip_r2 15 sample_R1.fastq.gz sample_R2.fastq.gz

followed by:

bismark --genome ../GRCh38/ --non_directional --score_min L,0,-0.4 -1 sample_R1_val_1.fq.gz -2 sample_R2_val_2.fq.gz

multiqc_report.zip

This brought the mapping efficiency up to > 51% unique alignments, so quite a nice increase I'd say. Attached is the MultiQC report.

I hope this is gives you something to work with?

VikArz02 commented 3 years ago

Yes, thank you very much for your help

вт, 21 сент. 2021 г. в 17:34, Felix Krueger @.***>:

After taking a quick look, the data seems to be non-directional human data, my guess would be prepared with the Zymo Pico-methyl kit? because of extensive bias at the 5' end, I ran Trim Galore like this:

trim_galore --paired --clip_r1 15 --clip_r2 15 sample_R1.fastq.gz sample_R2.fastq.gz

followed by:

bismark --genome ../GRCh38/ --score_min L,0,-0.4 -1 sample_R1_val_1.fq.gz -2 sample_R2_val_2.fq.gz

multiqc_report.zip https://github.com/FelixKrueger/Bismark/files/7203921/multiqc_report.zip

This brought the mapping efficiency up to > 51% unique alignments, so quite a nice increase I'd say. Attached is the MultiQC report.

I hope this is gives you something to work with?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-924050304, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMALW32ZRUEEOBHXX4TUDCJY5ANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- С уважением, Виктория

VikArz02 commented 3 years ago

But can i ask, why do we have so low mapping efficiency if it's human genome?

вт, 21 сент. 2021 г. в 17:37, Viktoriia Arzumanian < @.***>:

Yes, thank you very much for your help

вт, 21 сент. 2021 г. в 17:34, Felix Krueger @.***>:

After taking a quick look, the data seems to be non-directional human data, my guess would be prepared with the Zymo Pico-methyl kit? because of extensive bias at the 5' end, I ran Trim Galore like this:

trim_galore --paired --clip_r1 15 --clip_r2 15 sample_R1.fastq.gz sample_R2.fastq.gz

followed by:

bismark --genome ../GRCh38/ --score_min L,0,-0.4 -1 sample_R1_val_1.fq.gz -2 sample_R2_val_2.fq.gz

multiqc_report.zip https://github.com/FelixKrueger/Bismark/files/7203921/multiqc_report.zip

This brought the mapping efficiency up to > 51% unique alignments, so quite a nice increase I'd say. Attached is the MultiQC report.

I hope this is gives you something to work with?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-924050304, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMALW32ZRUEEOBHXX4TUDCJY5ANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

FelixKrueger commented 3 years ago

That is a tricky question, and not something I can give you a perfect answer to. Let's phrase it this way: The best mapoign effieciencies agains the human genome I have seen were in the region of 85-88% using end-to-end alignment, and very good quality standard, directional 2x100bp data.

Your data isn't that, it is non-directional, with weird biases at the start, and I have no idea how it was generated. PBAT-style data suffers froma range of issues such as chimearic reads (https://sequencing.qcfail.com/articles/pbat-libraries-may-generate-chimaeric-read-pairs/), 5' biases (https://sequencing.qcfail.com/articles/mispriming-in-pbat-libraries-causes-methylation-bias-and-poor-mapping-efficiencies/), as well as standard paired-end issues (see e.g. here: https://github.com/FelixKrueger/Bismark/blob/master/Docs/FAQ.md#low-mapping-effiency-of-paired-end-bisulfite-seq-sample).

I have just tried a test with different stringencies on a single end read, and this seems to have quite some impact on both the mapping efficiency as well as the average methylation levels:

--score_min L,0,-0.2: 49%
--score_min L,0,-0.4: 57%
--score_min L,0,-0.6: 66%

So something in your library preparation might also be introducing errors... If possible I would recommend using a more straight forward (directional) kit, but if you have very low starting material you might be limited in your choices...

VikArz02 commented 3 years ago

Okey, it's a very useful answer. I clarified which kit we used QIAseq Methyl Library (Qiagen) and we analyse HepG2 and aligned it to the human genome. Maybe the problem is in it.

вт, 21 сент. 2021 г. в 18:01, Felix Krueger @.***>:

That is a tricky question, and not something I can give you a perfect answer to. Let's phrase it this way: The best mapoign effieciencies agains the human genome I have seen were in the region of 85-88% using end-to-end alignment, and very good quality standard, directional 2x100bp data.

Your data isn't that, it is non-directional, with weird biases at the start, and I have no idea how it was generated. PBAT-style data suffers froma range of issues such as chimearic reads ( https://sequencing.qcfail.com/articles/pbat-libraries-may-generate-chimaeric-read-pairs/), 5' biases ( https://sequencing.qcfail.com/articles/mispriming-in-pbat-libraries-causes-methylation-bias-and-poor-mapping-efficiencies/), as well as standard paired-end issues (see e.g. here: https://github.com/FelixKrueger/Bismark/blob/master/Docs/FAQ.md#low-mapping-effiency-of-paired-end-bisulfite-seq-sample ).

I have just tried a test with different stringencies on a single end read, and this seems to have quite some impact on both the mapping efficiency as well as the average methylation levels:

--score_min L,0,-0.2: 49% --score_min L,0,-0.4: 57% --score_min L,0,-0.6: 66%

So something in your library preparation might also be introducing errors... If possible I would recommend using a more straight forward (directional) kit, but if you have very low starting material you might be limited in your choices...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-924075189, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMCMQ7LOWAXENFFH5VDUDCM6LANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

FelixKrueger commented 3 years ago

Could be that HepG2 is just a little different to the standard human genome... What does the Qiagen say about bioinformatics processing downstream? But yea, all in all it's not bad!

VikArz02 commented 3 years ago

Felix, hi! I did it according to your pipeline, but I got only 22.7% unique alignments. You had > 51%. Why can't i repeat your result? Thanks for your help! With best regard, Viktoriia

вт, 21 сент. 2021 г. в 19:16, Felix Krueger @.***>:

Closed #460 https://github.com/FelixKrueger/Bismark/issues/460.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#event-5336918661, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMBY3X2FITTEYCCJVJ3UDCVVNANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

FelixKrueger commented 3 years ago

Hmm, what did you do exactly, and were there any error messages? Do you have enough system resources available?

VikArz02 commented 3 years ago

What i did:

  1. trim_galore --paired --clip_r1 15 --clip_r2 15 sample_R1.fastq.gz sample_R2.fastq.gz
  2. bismark --genome /GRCh38/ --score_min L,0,-0.4 -1 sample_R1_val_1.fq.gz -2 sample_R2_val_2.fq.gz I have this available system resources: MEM 157G, threads 32, Swp 8G. And i tried only files that I sent to you.

And there can only be this error, if it can be called such "Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!) Setting parallelization to single-threaded (default)"

вт, 12 окт. 2021 г. в 10:51, Felix Krueger @.***>:

Hmm, what did you do exactly, and were there any error messages? Do you have enough system resources available?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-940756140, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMCXDLOH4Q4CGCNQL5DUGPSIRANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- С уважением, Виктория

VikArz02 commented 3 years ago

And so i prepared genome file: ~/.../Bismark-0.22.3/bismark_genome_preparation --bowtie2 Homo_sapiens.GRCh38.dna.primary_assembly.fa

вт, 12 окт. 2021 г. в 11:00, Viktoriia Arzumanian < @.***>:

What i did:

  1. trim_galore --paired --clip_r1 15 --clip_r2 15 sample_R1.fastq.gz sample_R2.fastq.gz
  2. bismark --genome /GRCh38/ --score_min L,0,-0.4 -1 sample_R1_val_1.fq.gz -2 sample_R2_val_2.fq.gz I have this available system resources: MEM 157G, threads 32, Swp 8G. And i tried only files that I sent to you.

And there can only be this error, if it can be called such "Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!) Setting parallelization to single-threaded (default)"

вт, 12 окт. 2021 г. в 10:51, Felix Krueger @.***>:

Hmm, what did you do exactly, and were there any error messages? Do you have enough system resources available?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-940756140, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMCXDLOH4Q4CGCNQL5DUGPSIRANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

--

FelixKrueger commented 3 years ago

According to a note further above I mentioned that the data looks non-directional, but I seem to have omitted that in the command itself (now fixed).

Just repeat the alignments with --non_directional, then the results should be the same.

Apologies, Felix.

VikArz02 commented 3 years ago

Thanks so much, I got the same result!

вт, 12 окт. 2021 г. в 11:06, Felix Krueger @.***>:

According to a note further above I mentioned that the data looks non-directional, but I seem to have omitted that in the command itself (now fixed).

Just repeat the alignments with --non_directional, then the results should be the same.

Apologies, Felix.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/460#issuecomment-940767003, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTXBMBRV4IZ7ISTUN44RADUGPUBHANCNFSM5EOGZVHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.