epi2me-labs / pychopper

cDNA read preprocessing
Other
55 stars 9 forks source link

Auto tuning of q parameter not responding to -Y or defaults #50

Closed lindehell closed 3 months ago

lindehell commented 8 months ago

Hi

I've been using pychopper a number of times in the past both standalone and as part of the wf-transcriptome pipeline, but for my latest project I've been experiencing problems with the autotuning of the q cutoff parameter.

I've tried running pychopper with and without specifying the -Y parameter and every time the software tries to tune the q using 100% of my reads (~40 million per sample) resulting in an unreasonable runtime. I have tried changing other parameters too, but to no effect.

I've gone through the issues and can see that this is a recurring problem and that also that it has been fixed. But I installed using the recommended conda command and am using v.2.7.9 and still I get it. Is there a workaround, specific old version or a dev-branch I could try?

I would be delighted if you could shed some light on this issue.

I include a typical pychopper command below and the starting output below:

pychopper -r 2_pychopper/report.pdf -Y 10000 0_raw/sample1.fastq.gz 2_pychopper/sample1.pychopped.fq

Using kit: /home/henrik2/miniconda3/envs/minimap2/lib/python3.8/site-packages/pychopper/primer_data/cDNA_SSP_VNP.fas Configurations to consider: "+:SSP,-VNP|-:VNP,-SSP" Counting fastq records in input file:0_raw/sample1.fastq.gz Total fastq records in input file: 24 Tuning the cutoff parameter (q) on 40809168 sampled reads (100.0%) passing quality filters (Q >= 7.0). Optimizing over 30 cutoff values. 10%|██████████▌ | 3/30 [8:56:45<80:34:02, 10742.33s/it]

nrhorner commented 8 months ago

HI @lindehell

Thanks for submitting your issue. It looks like your issue may be the same as https://github.com/epi2me-labs/pychopper/issues/48. I'll take a look ASAP.

lindehell commented 8 months ago

I just tried running a containerized version of pychopper (2.7.4) with -Y 10000 and getting the same outcome.

Thank you for looking into it and for the fast response

nrhorner commented 3 months ago

Hi @lindehell sorry for the very late response.

Looking at it again, I can the issue is due to only 24 reads being identified at the start of the workflow. This number is used to calculate the proportion of reads to use in the tuning step. As this number comes out at 24, the proportion of reads to use is set to 1.0, the whole lot. This number is calculated here. https://github.com/epi2me-labs/pychopper/blob/master/pychopper/utils.py#L53. I'm not sure why this is happening, but I might be an issue with the file?

If this is still an issue for you, would it be possible for you to share a FASTQ file/snippet. I can send you a link to upload if necessary.

Thanks

nrhorner commented 3 months ago

This might be due to incorrectly formatted FASTQ inputs. Closing the issue now due to lack of respnse.