Filtering bisulfite fastq files

amsparks commented 1 year ago

Hello,

I'm trying to run FastQ-Screen on bisulfite data on sheep. I've managed to run FastQ-Screen successfully before as a QC check on the data (on 80GB memory), but am having trouble with memory now (currently trying 150GB memory) that I am trying to create a tag file to filter out controls (puc19 and lambda) which were spiked in with my samples to test for bisulfite conversion efficiency. I am running this on a HPC and have noticed that a number of 'core.#####' files are created in the directory which seems to be ~23GB each and I wonder if this is related to the problem. I never noticed these files before when running fastq-screen without --tag (or on any other jobs on the HPC) - are these files created by FastQ-Screen and are they needed? If you have any advice on how to reduce the memory needed for the job that would be really appreciated - I am currently only working on a pilot dataset and will be working with a much larger dataset in the future.

The code I'm using is:

fastq_screen --tag --conf fastq_screen.conf --bisulfite --outdir tagged_data $datapath/*.gz

Thanks so much in advance, Alex

StevenWingett commented 1 year ago

Hi Alex,

That is a huge amount of memory and I'm surprised it isn't working. I don't remember FastQ Screen making core.#### files either. Perhaps they are generated by Bowtie2 or Bismark.

What is the file size of this pilot dataset?

Does FastQ Screen work without using the --tag command? That command would be:

fastq_screen --conf fastq_screen.conf --bisulfite --outdir tagged_data $datapath/*.gz

Best, Steven

amsparks commented 1 year ago

Hi Steven,

Thanks so much for your quick reply!

The dataset I'm working with is 542GB, there are 88 files and each file is between 4-10 GB.

FastQ-Screen works fine without --tag - the job finished successfully in under 15 hours and only needed 15GB memory.

In contrast when I include tag I am using 150GB memory and it takes 168 hours to process just over 50% of the data. I've noticed with some more tests that these core files were produced for a job where I had requested less memory and the job was near the limit of the requested memory (99.75%) but not for another job where I requested more memory and memory usage was lower (97.42%). I'm running this in snakemake so I can easily see which files were incomplete when the job timed out on the HPC, but have also tried running this outside snakemake in a batch job. I will keep digging but let me know if you have any suggestions. How many cores/threads would you usually suggest for a dataset my size?

Also, just to double check - I'm trying to filter out the controls (puc19 and lambda) from my sheep reads. In my conf file I only have the sheep, puc19 and lambda databases as I assumed I would later be filtering out reads from the tagged file that match puc19 or lambda but not sheep. Am I assuming correctly or do I not need to include the sheep database in my conf file and only puc19 and lambda? Just thought I would check as perhaps this would help speed it up a bit.

Thanks a lot, Alex

StevenWingett commented 1 year ago

Hi Alex,

It's hard for me to assess what is causing this. FastQ Screen silently tags every read as part of normal processing, so I'm not sure why this is causing a problem.

Just a simple suggestion - have your tried processing the files sequentially rather than parallelising (say, just use 8 threads). Maybe the parallelisation is not working as expected for some reason.

Yes, you could try tagging with puc19 and lambda. That is all you need to remove reads.

I hope that helps.

All the best, Steven

StevenWingett / FastQ-Screen

Filtering bisulfite fastq files #63