Closed AdnanAbouelela closed 1 year ago
Hi @AdnanAbouelela
Thanks for raising this issue. I am looking into it now
I am able to replicate this problem. Thanks for the test data!
Hi @AdnanAbouelela
V2.7.7 has been released which should fix this issue you reported. Please get in touch again if you have any more problems.
Thanks,
Neil
Hi everyone,
thank you for providing this useful tool! I noticed, that some of the UMIs detected by Pychopper contain N nucleotides and was wondering if this is intentional or an artifact. They appear to be at the end of the UMI and occur at a very low frequency (e.g. 50 reads in roughly 1M).
Specifically, the library has been prepared using the ONT PCB111.24 (PCR-cDNA Barcoding) kit and was sequenced on a FLO-MIN106D (R9.4.1) with an MK1C. The reads were manually basecalled and demultiplexed using guppy (dna_r9.4.1_450bps_hac.cfg). A minimal fastq file containing 4 reads from this sequencing run is uploaded here:
https://drive.google.com/drive/folders/17iotOe6jDRXF7-5a-36SkWT1m1vBj8dg?usp=sharing
When running Pychopper (version 2.7.2, environment yaml attached) using
singularity exec ../../singularity/pychopper_ubuntu-22.04.sif pychopper -r out/report.pdf -u out/unclassified.fq -w out/rescued.fq -S out/stats.tsv -k PCS111 -q 1e-05 -U mre.fastq out/mre_out.fq
two reads on either of the + and - strands contain correct UMIs, whereas the other two reads produce UMIs "TTTAGGCTTGAGCTTACCGTTGGACTTN" (- strand) or "NNNCCCCTTAAGATTCGGGTTGGGATTT" (+ strand). All output files are accessible in the drive. I also noticed, that in the latter case, the detected UMIs lie within the reported sequence, and it is unclear to me why such cases would be included. The cut-off parameter of 1e-05 was determined based on the full dataset (report pdf is attached as well).
Currently, I am just discarding those "artifactual" reads, but I am curious as to why this happens. Please let me know if I need to provide any additional information.
Thank you so much! Adnan