Memory Issue / Number of Reads to Autotune On

BuddahKat commented 2 years ago

Just wanted to report a possible issue... I recently attempted to run pychopper on a relatively large FASTQ file (~12Gb) and it crashed at the tuning step. It seems just based on the screen output that the program is attempting to autotune on the total number of reads in the file and not respecting the default value of 10000 or manually setting the -Y flag value. I'm running on a machine with 32Gb of memory and its completely consumed when I attempt to run pychopper on the full file causing the system to crash. I did a test with a smaller file sampled from the larger one (10000 reads) and confirmed (at least based on what's reported on the screen...) that it still attempts to autotune on the total number of reads (though its now able to successfully finish execution).

Here's command / output of full file (total reads in the FASTQ is 4391012) -

pychopper -r pychopper/SRR16770671_pychopperReport.pdf -u pychopper/SRR16770671_pychopperUnclassified.fastq -w pychopper/SRR16770671_pychopperRescued.fastq SRR16770671.fastq `SRR16770671_pychopperFiltered.fastq
Using kit: PCS109
Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP"
Counting fastq records in input file: SRR16770671.fastq
Total fastq records in input file: 0
Tuning the cutoff parameter (q) on 4233345 sampled reads (100.0%) passing quality filters (Q >= 7.0).

Same command but with -Y manually set to 10000 -

pychopper -t 11 -Y 10000 -r pychopper/SRR16770671_pychopperReport.pdf -u pychopper/SRR16770671_pychopperUnclassified.fastq -w pychopper/SRR16770671_pychopperRescued.fastq SRR16770671.fastq SRR16770671_pychopperFiltered.fastq
Using kit: PCS109
Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP"
Counting fastq records in input file: SRR16770671.fastq
Total fastq records in input file: 0
Tuning the cutoff parameter (q) on 4233345 sampled reads (100.0%) passing quality filters (Q >= 7.0).

Attempt on subsample of 10000 reads -

fastq-sample SRR16770671.fastq
pychopper -t 11 -Y 10000 -r pychopper/SRR16770671_pychopperReport.pdf -u pychopper/SRR16770671_pychopperUnclassified.fastq -w pychopper/SRR16770671_pychopperRescued.fastq SRR16770671_sample.fastq SRR16770671_pychopperFiltered.fastq
Using kit: PCS109
Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP"
Counting fastq records in input file: SRR16770671_sample.fastq
Total fastq records in input file: 0
Tuning the cutoff parameter (q) on 9643 sampled reads (100.0%) passing quality filters (Q >= 7.0).

The top two cause my system to crash due to exhausting the memory, while the bottom completes fine. I ended up just manually setting the -q for the full file based on the subsample so was able to get it working, but thought it would be worthwhile to report anyway.

Also just FYI here's some Python warnings I noticed while testing (running Python 3.8.12) in case they're helpful for this issue or future releases -

/home/Miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py:6982: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
/home/Miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py:6982: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)
  return Index(sequences[0], name=names)

nrhorner commented 2 years ago

Hi @tjten

Thanks a lot for for making us aware of this issue. I will look into it.

nrhorner commented 1 year ago

Hi @BuddahKat

This issue should have been fixed in the latest release.

epi2me-labs / pychopper

Memory Issue / Number of Reads to Autotune On #2