MarWoes / wg-blimp

wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data
GNU Affero General Public License v3.0
27 stars 12 forks source link

Issue with bwameth.py errno 32 #17

Closed JakeLehle closed 2 years ago

JakeLehle commented 2 years ago

Hello,

I've been trying to adopt this pipeline for some of our labs WGBS data and I thought I had gotten past most of the bugs but I got a new one that has me stumped so I wanted to open an issue,

My samples are in mice so I updated the config file and downloaded my own reference genome and annotation file from Ensembl as well as a CpG island annotation from UCSC as you suggested in some past issue threads. (Thank you for that by the way).

I'll upload my config file so you can see how I have everything set up.

I'm getting a new error message in the results folder under with the name Sample.align.log that seems to indicate there is an issue with the bwameth.py script errno32 broken pipe.

I'm not sure if this is trying to indicate that the script is having issues with aligning my reads or what.

Any help with this would be a big help.

image

image

image

MarWoes commented 2 years ago

Hi, thanks for reaching out!

Could you provide some information on the technical infrastructure you are running the analysis on (e.g. SLURM or a regular root server) and the exact command you used to run the analysis? Additionally it would be helpful to have a subsampled version of your .fq.gz files, if that's possible?

Best, Marius

JakeLehle commented 2 years ago

So to set this all up I'm running this on a ubuntu virtual machine with 32GB of ram. So I'm running this in a regular root server environment for these purposes. Once everything is working and I can check it with the shiny server I was going to push it over to our school's cluster and run it using SLURM.

I run the snake-make from the wgbs2-config.yml file I posted up above and run it with this command.

$wg-blimp run-snakemake-from-config --cores=40 wgbs2-config.yaml

I saw an easy mistake in my previous post on the wgbs2-config.yml file. I have the methylseekr_pmd_chromosome: chr22 However, mice don't have a chr22 haha. So to fix this I downloaded the mouse reference genome for just chr19 from Ensemble last night and changed the value of methylseekr_pmd_chromosome: chr19 as well as the ref: to the new GRCm39_chr19.fa file I downloaded and tried running everything again. I was hoping this would make alignment easier similar to the practice data you provide but I'm still running into the same broken pipe err 32 from bwameth.

Let me work on subsetting the fq.gz data files and see if I can get you that soon so you can set everything up. (I've never actually done that before, any tips on how to do that? Would I just output the top of each file and save the first 50 lines into a new file?) Does it matter if the data is trimmed or not? I've been running trimmed data for these tests, but I have both.

JakeLehle commented 2 years ago

Okay, I figured out how to subset the files. I used this command on each of the files to output the first 50 lines from each of my files into a smaller subset. I renamed the files with a "50" at the end.

$head --lines=50 ES_L001_R1_002.fq.gz > ES_L001_R1_002_50.fq.gz

TF_L001_R1_001_50.fq.gz ES_L001_R2_002_50.fq.gz ES_L001_R2_001_50.fq.gz ES_L001_R1_002_50.fq.gz ES_L001_R1_001_50.fq.gz TF_L001_R2_002_50.fq.gz TF_L001_R2_001_50.fq.gz TF_L001_R1_002_50.fq.gz

I'm going to attach an updated wgb2-config.yml file of what the job looks like as well to this comment that also updated the regular expression to include the new files with the "50" at the end of the file name.

image

Let me know if you need anything else to help you run my data set. Thank you so much for your help with this.

MarWoes commented 2 years ago

Thanks for providing the subsampled files, that was exactly what I was l looking for :) I'll see if I can find the source of the error!

Also for the segmentation: Make sure to use the entire reference .fasta reference including all chromosomes in your analysis. The MethylSeekR chromosome will be extracted from the .fasta file using the identifier provided and is only used for calibration. You might as well select any chromosome here. The .fai file contains the name and length of each chromosome sequence, so you can just select one of those.

I assume you downloaded the reference from Ensembl? Then chromosome 19 of the mice genome would carry the name CM001012.3 .

Best, Marius

MarWoes commented 2 years ago

Small update: The Ensembl fasta is also labeled via "human readable" chromosome names: Setting '19' as calibration chromosome would be correct here.

JakeLehle commented 2 years ago

Okay, I'll make that change and change the name to '19' see if that helps!

I've been playing around with running those smaller subsets of data (I took the first 10000 reads of my fastq.gz sample files) and this actually fixed my problem with bwameth.py errno 32. I'm pretty sure my issue was due to running out of RAM and crashing the script since I'm running this on a virtual machine, but running the smaller sets of data seemed to fix this, but I hit a new issue at the 71% point in the pipeline where the dmrs are annotated by bsseq.

Looks like bsseq I think is having issues with the csv annotation file that I downloaded from the UCSC table browser and the names on the unlocalized genomic contig regions such as chrUn_GL456370v1 aren't matching up to either the GTF annotation file or the refrence_genome.fq.fai file produced by BWA which has these regions listed in this format GL456210.1 Do I need to manually change these names to try and get the formats to match? I can do a quick pattern search and replace with vi.

I've already tried just removing the bsseq step from the config file rerunning everything and it just gets stuck at 71% on metilene step with a similar err output.

However, since this is a now new issue I don't know if you want to close out this old issue and open a new one or edit the issue name so it doesn't get too convoluted. Let me know if you hit this issue as well running the data sets with only 50 lines of reads and I can send you the screenshots of the new error code either here or in a new issue thread depending on how you want to organize this.

I was rereading the issues and I think this new issue is linked to #5

Best, Jake

JakeLehle commented 2 years ago

Haha okay, I take back a few things in my previous comment. I've been learning more about your pipeline and I think I know why I'm getting hung up.

There is nothing wrong with either of the .csv.gz cgi or repeat masker files from UCSC. I compared those to the ones included with the sample data blood and sperm sample and the format match up. Same for the reference genome and annotation gtf file, those are fine as well.

My issue is coming after bwa finishes up doing the alignment and makes the .bam and .bai files. Those files then get used by MethylDackle which makes the first .bedGraph files and then you have that transformBedGraph.R R script that should combine the information from the sample_CpG.bedGraph to make the _CpG_ratio.bedGraph file for each group. I can see in the snakemake script file it looks like either the .bedGraph or _CpG_ratio.bedGraph files are then compared by bsseq and metilene but here is my issue. Ther are CpG information that are unique to each of my groups. One of my groups has rows such as MU069434.1, GL456372.1, GL456378.1, GL456381.1 which are not in the other second group. While the second group also has rows GL456354.1, GL456368.1, GL456370.1, GL456387.1, JH584296.1, JH584297.1 which are not in the first group. Because of this the program sees these unique scaffold regions in each of the two groups, thinks I used a different reference genome to do the annotation and that's why they aren't matching up so it crashes.

Here is the error message from the bsseq.log file so you can see a little of this issue. image

This is likely due to running the small sample data sets. I'll move everything I have over to our school's cluster and try running the full pipeline. (I think everything is working fine I'm just getting errors by trying to run it on my personal computer). However, without this, I wouldn't have had a chance to really dig into your pipeline. It's a really cool resource the way you strung all this different software together thank you again for putting this up.

I'll keep playing around with it and try just removing all the unique row names as the last troubleshooting measure from the .bedGraph files and see if I can get the rest of the pipeline to run or just try running the rest of the pipeline manually following along with your snakemake file.

Best, Jake

MarWoes commented 2 years ago

Hi Jake,

Thanks for the update! From what I can tell, the data should be fine for analysis (at least I did not encounter any errors with my test setup).

It is possible that bsseq is failing due to RAM issues, it is kind of known to consume a lot of memory, so for your test runs you might want to consider to simply drop bsseq from the list of DMR callers and use only metilene for test purposes. The screenshot you posted does not indicate any error. This is only a warning generated by the R packages used, and would not break execution, so that would be another hint that it may be too few RAM.

I'm also thinking of just adding the mouse genome to the built-in genomes (currently only human and macaca genomes are supported), then the analysis would be more straightforward :)

I'm happy to help, don't hesitate to post further updates if anything unexpectantly breaks during your analyses runs ;)

Best, Marius